[llvm-dev] RFC: System (cache, etc.) model for LLVM

Thu Nov 1 14:36:24 PDT 2018

Am Do., 1. Nov. 2018 um 15:21 Uhr schrieb David Greene <dag at cray.com>>
> > thank you for sharing the system hierarchy model. IMHO it makes a lot
> > of sense, although I don't know which of today's passes would make use
> > of it. Here are my remarks.
>
> LoopDataPrefetch would use it via the existing TTI interfaces, but I
> think that's about it for now.  It's a bit of a chicken-and-egg, in that
> passes won't use it if it's not there and there's no push to get it in
> because few things use it.  :)

What kind of passes is using it in the Cray compiler?

> > I am wondering how one could model the following features using this
> > model, or whether they should be part of a performance model at all:
> >
> >  * ARM's big.LITTLE
>
> How is this modeled in the current AArch64 .td files?  The current
> design doesn't capture heterogeneity at all, not because we're not
> interested but simply because our compiler captures that at a higher
> level outside of LLVM.

AFAIK it is not handled at all. Any architecture that supports
big.LITTLE will return 0 on getCacheLineSize(). See
AArch64Subtarget::initializeProperties().

> >  * write-back / write-through write buffers
>
> Do you mean for caches, or something else?

https://en.wikipedia.org/wiki/Cache_%28computing%29#Writing_policies

Basically, with write-though, every store is a non-temporal store (Or
temporal stores being a write-through, depending on how to view it)

> >>   class TargetSoftwarePrefetcherInfo {
> >>     /// Should we do software prefetching at all?
> >>     ///
> >>     bool isEnabled() const;
> >
> > isEnabled sounds like something configurable at runtime.
>
> Currently we use it to allow some subtargets to do software prefetching
> and prevent it for others.  I see how the name could be confusing
> though.  Maybe ShouldDoPrefetching?

isPrefetchingProfitable()?

If it is a hardware property:
isSupported()
(ie. prefetch instruction would be a no-op on other implementations)

> > Is there a way on which level the number of streams are shared? For
> > instance, a core might be able to track 16 streams, but if 4 threads
> > are running (SMT), each can only use 4.
>
> I suppose we could couple the streaming information to an execution
> resource, similar to what is done with cache levels to express this kind
> of sharing.  We haven't found a need for it but that doesn't mean it
> wouldn't be useful for other/new targets.

The example above is IBM's Blue Gene/Q processor, so yes, such targets do exist.

> > PowerPC's dcbt/dcbtst instruction allows explicitly specifying to the
> > hardware which streams it should establish. Do the buffer counts
> > include explicitly and automatically established streams? Do
> > non-stream accesses (e.g. stack access) count towards
>
> It's up to the target maintainer to decide what the numbers mean.
> Obviously passes have to have some notion of what things mean.  The
> thing that establishes what a "stream" is in the user program lives
> outside of the system model.  It may or may not consider random stack
> accesses as part of a stream.
>
> This is definitely an area for exploration.  Since we only have machines
> with two major targets, we didn't need to contend with more exotic
> things.  :)

IMHO it would be good if passes and targets agree on an interpretation
of this number when designing the interface.

Again, from the Blue Gene/Q: What counts as stream is configurable at
runtime via a hardware register. It supports 3 settings:
* Interpret every memory access as start of a stream
* Interpret a stream when there are 2 consecutive cache misses
* Only establish streams via dcbt instructions.

> >>   class TargetMemorySystemInfo {
> >>     const TargetCacheLevelInfo &getCacheLevel(unsigned Level) const;
> >>
> >>     /// getNumLevels - Return the number of cache levels this target has.
> >>     ///
> >>     unsigned getNumLevels() const;
> >>
> >>     /// Cache level iterators
> >>     ///
> >>     cachelevel_iterator cachelevel_begin() const;
> >>     cachelevel_iterator cachelevel_end() const;
> >
> > May users of this class assume that a level refers to a specific
> > cache. E.g. getCacheLevel(0) being the L1 cache. Or so they have to
> > search for a cache of a specific size?
>
> The intent is that getCacheLevel(0) is the L1 cache, getCacheLevel(1) is
> the L2 cache and so on.

Can passes rely on it?

> >>     //===--------------------------------------------------------------------===//
> >>     // Stream Buffer Information
> >>     //
> >>     const TargetStreamBufferInfo *getStreamBufferInfo() const;
> >>
> >>     //===--------------------------------------------------------------------===//
> >>     // Software Prefetcher Information
> >>     //
> >>     const TargetSoftwarePrefetcherInfo *getSoftwarePrefetcherInfo() const;
> >
> > Would it make sense to have one PrefetcherInfo/StreamBuffer per cache
> > level? Some ISA have multiple prefetchers/prefetch instructructions
> > for different levels.
>
> Probably.  Most X86 implementations direct all data prefetches to the
> same cache level so we didn't find a need to model this, but it makes
> sense to allow for it.

Again the Blue Gene/Q: Streams prefetch into the L1P cache (P for
prefetch), but a dcbt instruction is necessary to establish the cache
line into the L1 cache.

> >> An open question is how to handle different SKUs within a subtarget
> >> family.  We modeled the limited number of SKUs used in our products
> >> via multiple subtargets, so this wasn't a heavy burden for us, but a
> >> more robust implementation might allow for multiple ``MemorySystem``
> >> and/or ``ExecutionEngine`` models for a given subtarget.  It's not yet
> >> clear whether that's a good/necessary thing and if it is, how to
> >> specify it with a compiler switch.  ``-mcpu=shy-enigma
> >> -some-switch-to-specify-memory-and-execution-models``?  It may very
> >> well be sufficient to have a general system model that applies
> >> relatively well over multiple SKUs.
> >
> > Adding more specific subtargets with more refined execution models
> > seem fine for me.
> > But is it reasonable to manage a database of all processors ever
> > produced in the compiler?
>
> No it is not.  :)  That's why this is an open question.  We've found it
> perfectly adequate to define a single model for each major processor
> generation, but as I said we use a limited number of SKUs.  We will
> need input from the community on this.

Independently on whether subtargets for SKUs are added, could we
(also) be able to define these parameters via the command line. Like
xlc's -qcache option.

Michael