[llvm-dev] RFC: System (cache, etc.) model for LLVM

Thu Nov 8 08:35:55 PST 2018

Michael Kruse <llvmdev at meinersbur.de> writes:

> Thank you for the detailed explanation. We could use a notion of
> "sustainable stream", i.e. the maximum number of (consecutive?)
> read/write streams that a processor can support before a
> disproportional loss in performance happens. This is oblivious to the
> reason why that performance loss happens, be it write combining
> buffers or prefetch streams. If there multiple such bottlenecks, it
> would be the minimum of such streams. At the moment I cannot think of
> an optimization where the difference matters (which doesn't mean there
> isn't a case where it does).

What about load prefetching vs. non-temporal stores on X86?  There's a
limited number of write-combining buffers but prefetches "just" use the
regular load paths.  Yes, there's a limited number of load buffers but I
would expect the the number of independent prefetch streams one would
want could differ substantially from the number of independent
non-tempooral store streams one would want and you wouldn't want the
minimum to apply to the other.

I like the idea of abstracting the hardware resource for the compiler's
needs, though I think we will in general want multiple such things.
Maybe one for load and one for store to start?  For more harware-y
things like llvm-mca more detail may be desired.

>> That seems ok to me.  As I understand it, L1P is a little awkward in
>> that L2 data doesn't get moved to L1P, it gets moved to L1.  L1P is
>> really a prefetch buffer, right?  One wouldn't do, say, cache blocking
>> for L1P.  In that sense maybe modeling it as a cache level isn't the
>> right thing.
>
> The L1P (4 KiB) is smaller than the L1 cache (16 KiB), so blocking
> indeed makes no sense.
>
> But when optimizing for it, I could not just ignore it. However, maybe
> we should leave it out for our API consideration. The Blue Gene/Q is
> phasing out and I know no other architecture which has this such a
> cache hierarchy.

Ok.  See more below.

>> How does software make use of L1P?  I understand compilers can insert
>> data prefetches and the data resides in L1P, presumably until it is
>> accessed and then it moves to L1.  I suppose the size of L1P could
>> determine how aggressively compilers prefetch.  Is that the idea or are
>> you thinking of something else?
>
> I declared streams for the CPU to prefetch (which 'run' at different
> speeds over the memory), which, at some point in time I can assume to
> be in the L1P cache. Using the dcbt instruction, the cache line can be
> lifted from the L1P to the L1 cache, a fixed number of cycles in
> advance. If the cache line had to be prefetched from L2, the
> prefetch/access latency would be longer (24 cycles vs 82 cycles).

Ok, I understand better now, thanks.  L1P really is a prefetch buffer
but there's software control to move it to faster cache if desired.
Should we model it as part of the prefetching API?

                              -David