[llvm-dev] RFC: System (cache, etc.) model for LLVM

Thu Nov 8 09:09:35 PST 2018

On 11/08/2018 10:35 AM, David Greene via llvm-dev wrote:
> Michael Kruse <llvmdev at meinersbur.de> writes:
>
>> Thank you for the detailed explanation. We could use a notion of
>> "sustainable stream", i.e. the maximum number of (consecutive?)
>> read/write streams that a processor can support before a
>> disproportional loss in performance happens. This is oblivious to the
>> reason why that performance loss happens, be it write combining
>> buffers or prefetch streams. If there multiple such bottlenecks, it
>> would be the minimum of such streams. At the moment I cannot think of
>> an optimization where the difference matters (which doesn't mean there
>> isn't a case where it does).
> What about load prefetching vs. non-temporal stores on X86?  There's a
> limited number of write-combining buffers but prefetches "just" use the
> regular load paths.  Yes, there's a limited number of load buffers but I
> would expect the the number of independent prefetch streams one would
> want could differ substantially from the number of independent
> non-tempooral store streams one would want and you wouldn't want the
> minimum to apply to the other.
>
> I like the idea of abstracting the hardware resource for the compiler's
> needs, though I think we will in general want multiple such things.
> Maybe one for load and one for store to start?  For more harware-y
> things like llvm-mca more detail may be desired.
>
>>> That seems ok to me.  As I understand it, L1P is a little awkward in
>>> that L2 data doesn't get moved to L1P, it gets moved to L1.  L1P is
>>> really a prefetch buffer, right?  One wouldn't do, say, cache blocking
>>> for L1P.  In that sense maybe modeling it as a cache level isn't the
>>> right thing.
>> The L1P (4 KiB) is smaller than the L1 cache (16 KiB), so blocking
>> indeed makes no sense.
>>
>> But when optimizing for it, I could not just ignore it. However, maybe
>> we should leave it out for our API consideration. The Blue Gene/Q is
>> phasing out and I know no other architecture which has this such a
>> cache hierarchy.
> Ok.  See more below.
>
>>> How does software make use of L1P?  I understand compilers can insert
>>> data prefetches and the data resides in L1P, presumably until it is
>>> accessed and then it moves to L1.  I suppose the size of L1P could
>>> determine how aggressively compilers prefetch.  Is that the idea or are
>>> you thinking of something else?
>> I declared streams for the CPU to prefetch (which 'run' at different
>> speeds over the memory), which, at some point in time I can assume to
>> be in the L1P cache. Using the dcbt instruction, the cache line can be
>> lifted from the L1P to the L1 cache, a fixed number of cycles in
>> advance. If the cache line had to be prefetched from L2, the
>> prefetch/access latency would be longer (24 cycles vs 82 cycles).
> Ok, I understand better now, thanks.  L1P really is a prefetch buffer
> but there's software control to move it to faster cache if desired.
> Should we model it as part of the prefetching API?

At this point, I'd not base any API-structuring decisions on the BG/Q
specifically. The generic feature that might be worth modeling is: Into
what level of cache does automated prefetching take place? I know of
several architectures that don't do automated prefetching into the L1,
but only into the L2 (or similar).

 -Hal

>
>                               -David
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory