[llvm-dev] RFC: System (cache, etc.) model for LLVM
David Greene via llvm-dev
llvm-dev at lists.llvm.org
Thu Nov 8 08:35:55 PST 2018
Michael Kruse <llvmdev at meinersbur.de> writes:
> Thank you for the detailed explanation. We could use a notion of
> "sustainable stream", i.e. the maximum number of (consecutive?)
> read/write streams that a processor can support before a
> disproportional loss in performance happens. This is oblivious to the
> reason why that performance loss happens, be it write combining
> buffers or prefetch streams. If there multiple such bottlenecks, it
> would be the minimum of such streams. At the moment I cannot think of
> an optimization where the difference matters (which doesn't mean there
> isn't a case where it does).
What about load prefetching vs. non-temporal stores on X86? There's a
limited number of write-combining buffers but prefetches "just" use the
regular load paths. Yes, there's a limited number of load buffers but I
would expect the the number of independent prefetch streams one would
want could differ substantially from the number of independent
non-tempooral store streams one would want and you wouldn't want the
minimum to apply to the other.
I like the idea of abstracting the hardware resource for the compiler's
needs, though I think we will in general want multiple such things.
Maybe one for load and one for store to start? For more harware-y
things like llvm-mca more detail may be desired.
>> That seems ok to me. As I understand it, L1P is a little awkward in
>> that L2 data doesn't get moved to L1P, it gets moved to L1. L1P is
>> really a prefetch buffer, right? One wouldn't do, say, cache blocking
>> for L1P. In that sense maybe modeling it as a cache level isn't the
>> right thing.
>
> The L1P (4 KiB) is smaller than the L1 cache (16 KiB), so blocking
> indeed makes no sense.
>
> But when optimizing for it, I could not just ignore it. However, maybe
> we should leave it out for our API consideration. The Blue Gene/Q is
> phasing out and I know no other architecture which has this such a
> cache hierarchy.
Ok. See more below.
>> How does software make use of L1P? I understand compilers can insert
>> data prefetches and the data resides in L1P, presumably until it is
>> accessed and then it moves to L1. I suppose the size of L1P could
>> determine how aggressively compilers prefetch. Is that the idea or are
>> you thinking of something else?
>
> I declared streams for the CPU to prefetch (which 'run' at different
> speeds over the memory), which, at some point in time I can assume to
> be in the L1P cache. Using the dcbt instruction, the cache line can be
> lifted from the L1P to the L1 cache, a fixed number of cycles in
> advance. If the cache line had to be prefetched from L2, the
> prefetch/access latency would be longer (24 cycles vs 82 cycles).
Ok, I understand better now, thanks. L1P really is a prefetch buffer
but there's software control to move it to faster cache if desired.
Should we model it as part of the prefetching API?
-David
More information about the llvm-dev
mailing list