[llvm-dev] RFC: System (cache, etc.) model for LLVM

Thu Nov 1 14:55:54 PDT 2018

Michael Kruse via llvm-dev <llvm-dev at lists.llvm.org> writes:

> Am Do., 1. Nov. 2018 um 15:21 Uhr schrieb David Greene <dag at cray.com>>
>> > thank you for sharing the system hierarchy model. IMHO it makes a lot
>> > of sense, although I don't know which of today's passes would make use
>> > of it. Here are my remarks.
>>
>> LoopDataPrefetch would use it via the existing TTI interfaces, but I
>> think that's about it for now.  It's a bit of a chicken-and-egg, in that
>> passes won't use it if it's not there and there's no push to get it in
>> because few things use it.  :)
>
> What kind of passes is using it in the Cray compiler?

Not sure how much I can say about that, unfortunately.

>> > I am wondering how one could model the following features using this
>> > model, or whether they should be part of a performance model at all:
>> >
>> >  * ARM's big.LITTLE
>>
>> How is this modeled in the current AArch64 .td files?  The current
>> design doesn't capture heterogeneity at all, not because we're not
>> interested but simply because our compiler captures that at a higher
>> level outside of LLVM.
>
> AFAIK it is not handled at all. Any architecture that supports
> big.LITTLE will return 0 on getCacheLineSize(). See
> AArch64Subtarget::initializeProperties().

Ok.  I would like to start posting patches for review without
speculating too much on fancy/exotic things that may come later.  We
shouldn't do anything that precludes extensions but I don't want to get
bogged down in a lot of details on things related to a small number of
targets.  Let's get the really common stuff in first.  What do you
think?

>> >  * write-back / write-through write buffers
>>
>> Do you mean for caches, or something else?
>
> https://en.wikipedia.org/wiki/Cache_%28computing%29#Writing_policies
>
> Basically, with write-though, every store is a non-temporal store (Or
> temporal stores being a write-through, depending on how to view it)

A write-through store isn't the same things as a non-temporal store, at
least in my understanding of the term from X86 and AArch64.  A
non-temporal store bypasses the cache entirely.

I'm struggling a bit to understand how a compiler would make use of the
cache's write-back policy.

>> >>   class TargetSoftwarePrefetcherInfo {
>> >>     /// Should we do software prefetching at all?
>> >>     ///
>> >>     bool isEnabled() const;
>> >
>> > isEnabled sounds like something configurable at runtime.
>>
>> Currently we use it to allow some subtargets to do software prefetching
>> and prevent it for others.  I see how the name could be confusing
>> though.  Maybe ShouldDoPrefetching?
>
> isPrefetchingProfitable()?

Sounds good.

> If it is a hardware property:
> isSupported()
> (ie. prefetch instruction would be a no-op on other implementations)

Oh, I hadn't even thought of that possibility.

>> > Is there a way on which level the number of streams are shared? For
>> > instance, a core might be able to track 16 streams, but if 4 threads
>> > are running (SMT), each can only use 4.
>>
>> I suppose we could couple the streaming information to an execution
>> resource, similar to what is done with cache levels to express this kind
>> of sharing.  We haven't found a need for it but that doesn't mean it
>> wouldn't be useful for other/new targets.
>
> The example above is IBM's Blue Gene/Q processor, so yes, such targets do exist.

Ok.

>> > PowerPC's dcbt/dcbtst instruction allows explicitly specifying to the
>> > hardware which streams it should establish. Do the buffer counts
>> > include explicitly and automatically established streams? Do
>> > non-stream accesses (e.g. stack access) count towards
>>
>> It's up to the target maintainer to decide what the numbers mean.
>> Obviously passes have to have some notion of what things mean.  The
>> thing that establishes what a "stream" is in the user program lives
>> outside of the system model.  It may or may not consider random stack
>> accesses as part of a stream.
>>
>> This is definitely an area for exploration.  Since we only have machines
>> with two major targets, we didn't need to contend with more exotic
>> things.  :)
>
> IMHO it would be good if passes and targets agree on an interpretation
> of this number when designing the interface.

Of course.

> Again, from the Blue Gene/Q: What counts as stream is configurable at
> runtime via a hardware register. It supports 3 settings:
> * Interpret every memory access as start of a stream
> * Interpret a stream when there are 2 consecutive cache misses
> * Only establish streams via dcbt instructions.

I think we're interpreting "streaming" differently.  In this design, a
"stream" is a sequence of memory operations that should bypass the cache
because the data will never be reused (at least not in a timely manner).

On X86 processor the compiler has complete software control over
streaming through the use of movnt instructions.  AArch64 has a similar,
though very restricted, capability until SVE.  dcbt is more like a
prefetch than a movnt, right?

It sounds like BG/Q has a hardware prefetcher configurable by software.
I think that would fit better under a completely different resource
type.  The compiler's use of dcbt would be guided by
TargetSoftwarePrefetcherInfo which could be extended to represent BG/Q's
configurable hardware prefetcher.

>> The intent is that getCacheLevel(0) is the L1 cache, getCacheLevel(1) is
>> the L2 cache and so on.
>
> Can passes rely on it?

Yes.

>> Probably.  Most X86 implementations direct all data prefetches to the
>> same cache level so we didn't find a need to model this, but it makes
>> sense to allow for it.
>
> Again the Blue Gene/Q: Streams prefetch into the L1P cache (P for
> prefetch), but a dcbt instruction is necessary to establish the cache
> line into the L1 cache.

Yep, makes sense.

>> > Adding more specific subtargets with more refined execution models
>> > seem fine for me.  But is it reasonable to manage a database of all
>> > processors ever produced in the compiler?
>>
>> No it is not.  :)  That's why this is an open question.  We've found it
>> perfectly adequate to define a single model for each major processor
>> generation, but as I said we use a limited number of SKUs.  We will
>> need input from the community on this.
>
> Independently on whether subtargets for SKUs are added, could we
> (also) be able to define these parameters via the command line. Like
> xlc's -qcache option.

I think that would be very useful.

                            -David