[llvm-dev] RFC: System (cache, etc.) model for LLVM

Michael Kruse via llvm-dev llvm-dev at lists.llvm.org
Wed Nov 7 15:26:39 PST 2018


Am Mo., 5. Nov. 2018 um 10:26 Uhr schrieb David Greene <dag at cray.com>:
> Yes, I agree the terminology is confusing.  I used the term "stream" in
> the sense of stream processing (https://en.wikipedia.org/wiki/Stream_processing).
> The programming model is very different, of course, but the idea of a
> stream of data that is acted upon and then essentially discarded is
> similar.
>
> > In the the latter interpretation, what does "number of streams" mean?
> > AFAIU the processer will just queue memory operations (e.g. for
> > writing to RAM). Is it the maximum number of operations in the queue?
>
> On X86 NT writes are to so-called "write-combined memory."
> Hardware-wise, that translates to some number of buffers where stores
> are collected and merged, resulting in a (hopefully) single cache
> line-sized write to memory that aggregates some number of individual
> stores.  The number of buffers limits the number of independent sets of
> stores that can be active.  For example, if the hardware has four such
> buffers, I can do this and be fine:
>
> for (...)
>   A[i] = ...
>   B[i] = ...
>   C[i] = ...
>   D[i] = ...
>
> The sequence of (non-temporal) stores to each array will map to separate
> hardware buffers and be write-combined in the way one would expect.  But
> this causes a problem:
>
> for (...)
>   A[i] = ...
>   B[i] = ...
>   C[i] = ...
>   D[i] = ...
>   E[i] = ...
>
> If all of the stores are non-temporal, then at least two of the array
> store sequences will interfere with each other in the write-combining
> buffers and will force early flushes of the buffer, effectively turning
> them into single-store writes to memory.  That's bad.
>
> Maybe the proper name for this concept is simply "WriteCombiningBuffer."
> I'm not sure if some other architecture might have a concept of store
> buffers that does something other than write-combining, so I was trying
> to use a fairly generic name to mean, "some compiler-controlled hardware
> buffer."
>
> There's a similar concept on the load side though I don't know if any
> existing processors actually implement things that way.  I know of
> (academic) architectures where prefetches fill independent prefetch
> buffers and one wouldn't want to prefetch too many different things
> because they would start filling each others' buffers.  That kind of
> behavior could be captured by this model.
>
> The key factor is contention.  There's a limited hardware memory buffer
> resource and compilers have to be careful not to oversubscribe it.  I
> don't know what the right name for it is.  Probably there will be more
> than one such type of resource for some architectures, so for now maybe
> we just model write-combining buffers and leave it at that.  If other
> such resources pop up we can model them with different names.
>
> I think that's what we should do.

Thank you for the detailed explanation. We could use a notion of
"sustainable stream", i.e. the maximum number of (consecutive?)
read/write streams that a processor can support before a
disproportional loss in performance happens. This is oblivious to the
reason why that performance loss happens, be it write combining
buffers or prefetch streams. If there multiple such bottlenecks, it
would be the minimum of such streams. At the moment I cannot think of
an optimization where the difference matters (which doesn't mean there
isn't a case where it does).


> That seems ok to me.  As I understand it, L1P is a little awkward in
> that L2 data doesn't get moved to L1P, it gets moved to L1.  L1P is
> really a prefetch buffer, right?  One wouldn't do, say, cache blocking
> for L1P.  In that sense maybe modeling it as a cache level isn't the
> right thing.

The L1P (4 KiB) is smaller than the L1 cache (16 KiB), so blocking
indeed makes no sense.

But when optimizing for it, I could not just ignore it. However, maybe
we should leave it out for our API consideration. The Blue Gene/Q is
phasing out and I know no other architecture which has this such a
cache hierarchy.


> How does software make use of L1P?  I understand compilers can insert
> data prefetches and the data resides in L1P, presumably until it is
> accessed and then it moves to L1.  I suppose the size of L1P could
> determine how aggressively compilers prefetch.  Is that the idea or are
> you thinking of something else?

I declared streams for the CPU to prefetch (which 'run' at different
speeds over the memory), which, at some point in time I can assume to
be in the L1P cache. Using the dcbt instruction, the cache line can be
lifted from the L1P to the L1 cache, a fixed number of cycles in
advance. If the cache line had to be prefetched from L2, the
prefetch/access latency would be longer (24 cycles vs 82 cycles).

Michael


More information about the llvm-dev mailing list