[llvm-dev] RFC: System (cache, etc.) model for LLVM
Michael Kruse via llvm-dev
llvm-dev at lists.llvm.org
Thu Nov 1 14:36:24 PDT 2018
Am Do., 1. Nov. 2018 um 15:21 Uhr schrieb David Greene <dag at cray.com>>
> > thank you for sharing the system hierarchy model. IMHO it makes a lot
> > of sense, although I don't know which of today's passes would make use
> > of it. Here are my remarks.
>
> LoopDataPrefetch would use it via the existing TTI interfaces, but I
> think that's about it for now. It's a bit of a chicken-and-egg, in that
> passes won't use it if it's not there and there's no push to get it in
> because few things use it. :)
What kind of passes is using it in the Cray compiler?
> > I am wondering how one could model the following features using this
> > model, or whether they should be part of a performance model at all:
> >
> > * ARM's big.LITTLE
>
> How is this modeled in the current AArch64 .td files? The current
> design doesn't capture heterogeneity at all, not because we're not
> interested but simply because our compiler captures that at a higher
> level outside of LLVM.
AFAIK it is not handled at all. Any architecture that supports
big.LITTLE will return 0 on getCacheLineSize(). See
AArch64Subtarget::initializeProperties().
> > * write-back / write-through write buffers
>
> Do you mean for caches, or something else?
https://en.wikipedia.org/wiki/Cache_%28computing%29#Writing_policies
Basically, with write-though, every store is a non-temporal store (Or
temporal stores being a write-through, depending on how to view it)
> >> class TargetSoftwarePrefetcherInfo {
> >> /// Should we do software prefetching at all?
> >> ///
> >> bool isEnabled() const;
> >
> > isEnabled sounds like something configurable at runtime.
>
> Currently we use it to allow some subtargets to do software prefetching
> and prevent it for others. I see how the name could be confusing
> though. Maybe ShouldDoPrefetching?
isPrefetchingProfitable()?
If it is a hardware property:
isSupported()
(ie. prefetch instruction would be a no-op on other implementations)
> > Is there a way on which level the number of streams are shared? For
> > instance, a core might be able to track 16 streams, but if 4 threads
> > are running (SMT), each can only use 4.
>
> I suppose we could couple the streaming information to an execution
> resource, similar to what is done with cache levels to express this kind
> of sharing. We haven't found a need for it but that doesn't mean it
> wouldn't be useful for other/new targets.
The example above is IBM's Blue Gene/Q processor, so yes, such targets do exist.
> > PowerPC's dcbt/dcbtst instruction allows explicitly specifying to the
> > hardware which streams it should establish. Do the buffer counts
> > include explicitly and automatically established streams? Do
> > non-stream accesses (e.g. stack access) count towards
>
> It's up to the target maintainer to decide what the numbers mean.
> Obviously passes have to have some notion of what things mean. The
> thing that establishes what a "stream" is in the user program lives
> outside of the system model. It may or may not consider random stack
> accesses as part of a stream.
>
> This is definitely an area for exploration. Since we only have machines
> with two major targets, we didn't need to contend with more exotic
> things. :)
IMHO it would be good if passes and targets agree on an interpretation
of this number when designing the interface.
Again, from the Blue Gene/Q: What counts as stream is configurable at
runtime via a hardware register. It supports 3 settings:
* Interpret every memory access as start of a stream
* Interpret a stream when there are 2 consecutive cache misses
* Only establish streams via dcbt instructions.
> >> class TargetMemorySystemInfo {
> >> const TargetCacheLevelInfo &getCacheLevel(unsigned Level) const;
> >>
> >> /// getNumLevels - Return the number of cache levels this target has.
> >> ///
> >> unsigned getNumLevels() const;
> >>
> >> /// Cache level iterators
> >> ///
> >> cachelevel_iterator cachelevel_begin() const;
> >> cachelevel_iterator cachelevel_end() const;
> >
> > May users of this class assume that a level refers to a specific
> > cache. E.g. getCacheLevel(0) being the L1 cache. Or so they have to
> > search for a cache of a specific size?
>
> The intent is that getCacheLevel(0) is the L1 cache, getCacheLevel(1) is
> the L2 cache and so on.
Can passes rely on it?
> >> //===--------------------------------------------------------------------===//
> >> // Stream Buffer Information
> >> //
> >> const TargetStreamBufferInfo *getStreamBufferInfo() const;
> >>
> >> //===--------------------------------------------------------------------===//
> >> // Software Prefetcher Information
> >> //
> >> const TargetSoftwarePrefetcherInfo *getSoftwarePrefetcherInfo() const;
> >
> > Would it make sense to have one PrefetcherInfo/StreamBuffer per cache
> > level? Some ISA have multiple prefetchers/prefetch instructructions
> > for different levels.
>
> Probably. Most X86 implementations direct all data prefetches to the
> same cache level so we didn't find a need to model this, but it makes
> sense to allow for it.
Again the Blue Gene/Q: Streams prefetch into the L1P cache (P for
prefetch), but a dcbt instruction is necessary to establish the cache
line into the L1 cache.
> >> An open question is how to handle different SKUs within a subtarget
> >> family. We modeled the limited number of SKUs used in our products
> >> via multiple subtargets, so this wasn't a heavy burden for us, but a
> >> more robust implementation might allow for multiple ``MemorySystem``
> >> and/or ``ExecutionEngine`` models for a given subtarget. It's not yet
> >> clear whether that's a good/necessary thing and if it is, how to
> >> specify it with a compiler switch. ``-mcpu=shy-enigma
> >> -some-switch-to-specify-memory-and-execution-models``? It may very
> >> well be sufficient to have a general system model that applies
> >> relatively well over multiple SKUs.
> >
> > Adding more specific subtargets with more refined execution models
> > seem fine for me.
> > But is it reasonable to manage a database of all processors ever
> > produced in the compiler?
>
> No it is not. :) That's why this is an open question. We've found it
> perfectly adequate to define a single model for each major processor
> generation, but as I said we use a limited number of SKUs. We will
> need input from the community on this.
Independently on whether subtargets for SKUs are added, could we
(also) be able to define these parameters via the command line. Like
xlc's -qcache option.
Michael
More information about the llvm-dev
mailing list