[llvm-dev] RFC: System (cache, etc.) model for LLVM
David Greene via llvm-dev
llvm-dev at lists.llvm.org
Thu Nov 1 14:55:54 PDT 2018
Michael Kruse via llvm-dev <llvm-dev at lists.llvm.org> writes:
> Am Do., 1. Nov. 2018 um 15:21 Uhr schrieb David Greene <dag at cray.com>>
>> > thank you for sharing the system hierarchy model. IMHO it makes a lot
>> > of sense, although I don't know which of today's passes would make use
>> > of it. Here are my remarks.
>>
>> LoopDataPrefetch would use it via the existing TTI interfaces, but I
>> think that's about it for now. It's a bit of a chicken-and-egg, in that
>> passes won't use it if it's not there and there's no push to get it in
>> because few things use it. :)
>
> What kind of passes is using it in the Cray compiler?
Not sure how much I can say about that, unfortunately.
>> > I am wondering how one could model the following features using this
>> > model, or whether they should be part of a performance model at all:
>> >
>> > * ARM's big.LITTLE
>>
>> How is this modeled in the current AArch64 .td files? The current
>> design doesn't capture heterogeneity at all, not because we're not
>> interested but simply because our compiler captures that at a higher
>> level outside of LLVM.
>
> AFAIK it is not handled at all. Any architecture that supports
> big.LITTLE will return 0 on getCacheLineSize(). See
> AArch64Subtarget::initializeProperties().
Ok. I would like to start posting patches for review without
speculating too much on fancy/exotic things that may come later. We
shouldn't do anything that precludes extensions but I don't want to get
bogged down in a lot of details on things related to a small number of
targets. Let's get the really common stuff in first. What do you
think?
>> > * write-back / write-through write buffers
>>
>> Do you mean for caches, or something else?
>
> https://en.wikipedia.org/wiki/Cache_%28computing%29#Writing_policies
>
> Basically, with write-though, every store is a non-temporal store (Or
> temporal stores being a write-through, depending on how to view it)
A write-through store isn't the same things as a non-temporal store, at
least in my understanding of the term from X86 and AArch64. A
non-temporal store bypasses the cache entirely.
I'm struggling a bit to understand how a compiler would make use of the
cache's write-back policy.
>> >> class TargetSoftwarePrefetcherInfo {
>> >> /// Should we do software prefetching at all?
>> >> ///
>> >> bool isEnabled() const;
>> >
>> > isEnabled sounds like something configurable at runtime.
>>
>> Currently we use it to allow some subtargets to do software prefetching
>> and prevent it for others. I see how the name could be confusing
>> though. Maybe ShouldDoPrefetching?
>
> isPrefetchingProfitable()?
Sounds good.
> If it is a hardware property:
> isSupported()
> (ie. prefetch instruction would be a no-op on other implementations)
Oh, I hadn't even thought of that possibility.
>> > Is there a way on which level the number of streams are shared? For
>> > instance, a core might be able to track 16 streams, but if 4 threads
>> > are running (SMT), each can only use 4.
>>
>> I suppose we could couple the streaming information to an execution
>> resource, similar to what is done with cache levels to express this kind
>> of sharing. We haven't found a need for it but that doesn't mean it
>> wouldn't be useful for other/new targets.
>
> The example above is IBM's Blue Gene/Q processor, so yes, such targets do exist.
Ok.
>> > PowerPC's dcbt/dcbtst instruction allows explicitly specifying to the
>> > hardware which streams it should establish. Do the buffer counts
>> > include explicitly and automatically established streams? Do
>> > non-stream accesses (e.g. stack access) count towards
>>
>> It's up to the target maintainer to decide what the numbers mean.
>> Obviously passes have to have some notion of what things mean. The
>> thing that establishes what a "stream" is in the user program lives
>> outside of the system model. It may or may not consider random stack
>> accesses as part of a stream.
>>
>> This is definitely an area for exploration. Since we only have machines
>> with two major targets, we didn't need to contend with more exotic
>> things. :)
>
> IMHO it would be good if passes and targets agree on an interpretation
> of this number when designing the interface.
Of course.
> Again, from the Blue Gene/Q: What counts as stream is configurable at
> runtime via a hardware register. It supports 3 settings:
> * Interpret every memory access as start of a stream
> * Interpret a stream when there are 2 consecutive cache misses
> * Only establish streams via dcbt instructions.
I think we're interpreting "streaming" differently. In this design, a
"stream" is a sequence of memory operations that should bypass the cache
because the data will never be reused (at least not in a timely manner).
On X86 processor the compiler has complete software control over
streaming through the use of movnt instructions. AArch64 has a similar,
though very restricted, capability until SVE. dcbt is more like a
prefetch than a movnt, right?
It sounds like BG/Q has a hardware prefetcher configurable by software.
I think that would fit better under a completely different resource
type. The compiler's use of dcbt would be guided by
TargetSoftwarePrefetcherInfo which could be extended to represent BG/Q's
configurable hardware prefetcher.
>> The intent is that getCacheLevel(0) is the L1 cache, getCacheLevel(1) is
>> the L2 cache and so on.
>
> Can passes rely on it?
Yes.
>> Probably. Most X86 implementations direct all data prefetches to the
>> same cache level so we didn't find a need to model this, but it makes
>> sense to allow for it.
>
> Again the Blue Gene/Q: Streams prefetch into the L1P cache (P for
> prefetch), but a dcbt instruction is necessary to establish the cache
> line into the L1 cache.
Yep, makes sense.
>> > Adding more specific subtargets with more refined execution models
>> > seem fine for me. But is it reasonable to manage a database of all
>> > processors ever produced in the compiler?
>>
>> No it is not. :) That's why this is an open question. We've found it
>> perfectly adequate to define a single model for each major processor
>> generation, but as I said we use a limited number of SKUs. We will
>> need input from the community on this.
>
> Independently on whether subtargets for SKUs are added, could we
> (also) be able to define these parameters via the command line. Like
> xlc's -qcache option.
I think that would be very useful.
-David
More information about the llvm-dev
mailing list