[llvm-dev] RFC: System (cache, etc.) model for LLVM

Mon Nov 5 08:25:48 PST 2018

Michael Kruse <llvmdev at meinersbur.de> writes:

>> > Again, from the Blue Gene/Q: What counts as stream is configurable at
>> > runtime via a hardware register. It supports 3 settings:
>> > * Interpret every memory access as start of a stream
>> > * Interpret a stream when there are 2 consecutive cache misses
>> > * Only establish streams via dcbt instructions.
>>
>> I think we're interpreting "streaming" differently.  In this design, a
>> "stream" is a sequence of memory operations that should bypass the cache
>> because the data will never be reused (at least not in a timely manner).
>
> I understood "stream" as "prefetch stream", something that prefetch
> the data for an access A[i] in a for-loop.
>
> I'd call "bypassing the cache because the data will never be reuse" a
> non-temporal memory access.

Yes, I agree the terminology is confusing.  I used the term "stream" in
the sense of stream processing (https://en.wikipedia.org/wiki/Stream_processing).
The programming model is very different, of course, but the idea of a
stream of data that is acted upon and then essentially discarded is
similar.

> In the the latter interpretation, what does "number of streams" mean?
> AFAIU the processer will just queue memory operations (e.g. for
> writing to RAM). Is it the maximum number of operations in the queue?

On X86 NT writes are to so-called "write-combined memory."
Hardware-wise, that translates to some number of buffers where stores
are collected and merged, resulting in a (hopefully) single cache
line-sized write to memory that aggregates some number of individual
stores.  The number of buffers limits the number of independent sets of
stores that can be active.  For example, if the hardware has four such
buffers, I can do this and be fine:

for (...)
  A[i] = ...
  B[i] = ...
  C[i] = ...
  D[i] = ...

The sequence of (non-temporal) stores to each array will map to separate
hardware buffers and be write-combined in the way one would expect.  But
this causes a problem:

for (...)
  A[i] = ...
  B[i] = ...
  C[i] = ...
  D[i] = ...
  E[i] = ...

If all of the stores are non-temporal, then at least two of the array
store sequences will interfere with each other in the write-combining
buffers and will force early flushes of the buffer, effectively turning
them into single-store writes to memory.  That's bad.

Maybe the proper name for this concept is simply "WriteCombiningBuffer."
I'm not sure if some other architecture might have a concept of store
buffers that does something other than write-combining, so I was trying
to use a fairly generic name to mean, "some compiler-controlled hardware
buffer."

There's a similar concept on the load side though I don't know if any
existing processors actually implement things that way.  I know of
(academic) architectures where prefetches fill independent prefetch
buffers and one wouldn't want to prefetch too many different things
because they would start filling each others' buffers.  That kind of
behavior could be captured by this model.

The key factor is contention.  There's a limited hardware memory buffer
resource and compilers have to be careful not to oversubscribe it.  I
don't know what the right name for it is.  Probably there will be more
than one such type of resource for some architectures, so for now maybe
we just model write-combining buffers and leave it at that.  If other
such resources pop up we can model them with different names.

I think that's what we should do.

>> >> The intent is that getCacheLevel(0) is the L1 cache, getCacheLevel(1) is
>> >> the L2 cache and so on.
>> >
>> > Can passes rely on it?
>>
>> Yes.
>
> Naively, I'd put Blue Gene/Q's L1P cache between the L1 and the L2,
> i.e. the L1P would be getCacheLevel(1)  and getCacheLevel(2) would be
> L2. How would you model it instead?

That seems ok to me.  As I understand it, L1P is a little awkward in
that L2 data doesn't get moved to L1P, it gets moved to L1.  L1P is
really a prefetch buffer, right?  One wouldn't do, say, cache blocking
for L1P.  In that sense maybe modeling it as a cache level isn't the
right thing.

How does software make use of L1P?  I understand compilers can insert
data prefetches and the data resides in L1P, presumably until it is
accessed and then it moves to L1.  I suppose the size of L1P could
determine how aggressively compilers prefetch.  Is that the idea or are
you thinking of something else?

                            -David