[llvm-dev] RFC: System (cache, etc.) model for LLVM

Thu Nov 1 13:21:16 PDT 2018

Michael, thank you for commenting!  Responses inline.

Let's continue discussing and if this seems like a reasonable way to
proceed, I can start posting patches for review.

                              -David

Michael Kruse <llvmdev at meinersbur.de> writes:

> thank you for sharing the system hierarchy model. IMHO it makes a lot
> of sense, although I don't know which of today's passes would make use
> of it. Here are my remarks.

LoopDataPrefetch would use it via the existing TTI interfaces, but I
think that's about it for now.  It's a bit of a chicken-and-egg, in that
passes won't use it if it's not there and there's no push to get it in
because few things use it.  :)

> I am wondering how one could model the following features using this
> model, or whether they should be part of a performance model at all:
>
>  * ARM's big.LITTLE

How is this modeled in the current AArch64 .td files?  The current
design doesn't capture heterogeneity at all, not because we're not
interested but simply because our compiler captures that at a higher
level outside of LLVM.

>  * NUMA hierarchies (are the NUMA domains 'caches'?)
>
>  * Total available RAM
>
>  * remote memory (e.g. RAM on an accelerator mapped into the address space)
>
>  * scratch pad

I expect we would expand TargetMemorySystemInfo to hold different kinds
of memory-related things.  Each of these could be a memory resource.  Or
maybe we would want something that lives "next to"
TargetMemorySystemInfo.

>  * write-back / write-through write buffers

Do you mean for caches, or something else?

>  * page size
>
>  * TLB capacity

>  * constructive/destructive interference
> (https://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size)
>    Some architecture have instructions to zero entire cache lines,
> e.g. dcbz on PowerPC, but it requires the cache line to be correct.
> Also see https://www.mono-project.com/news/2016/09/12/arm64-icache/
>
>  * Instruction cache

These could go into TargetMemorySystemInfo I think.

> Am Di., 30. Okt. 2018 um 15:27 Uhr schrieb David Greene via llvm-dev
> <llvm-dev at lists.llvm.org>:
>>   class TargetCacheLevelInfo {
>>     /// getWays - Return the number of ways.
>>     ///
>>     unsigned getWays() const;
>
> That is, associativity?

Yes.  Naming is certainly flexible.

> Bandwidth might be a useful addition, e.g. if a performance analysis
> tools uses the roofline model.

Yes.

>>   class TargetSoftwarePrefetcherInfo {
>>     /// Should we do software prefetching at all?
>>     ///
>>     bool isEnabled() const;
>
> isEnabled sounds like something configurable at runtime.

Currently we use it to allow some subtargets to do software prefetching
and prevent it for others.  I see how the name could be confusing
though.  Maybe ShouldDoPrefetching?

>> ``get*Distance`` APIs provide general hints to guide the software
>> prefetcher.  The software prefetcher may choose to ignore them.
>> getMinDistance and getMaxDistance act as clamps to ensure the software
>> prefetcher doesn't do something wholly inappropriate.
>>
>> Distances are specified in terms of cache lines.  The current
>> ``TargetTransformInfo`` interfaces speak in terms of instructions or
>> iterations ahead.  Both can be useful and so we may want to add
>> iteration and/or instruction distances to this interface.
>
> Would it make sense to specify a prefetch distance in bytes instead of
> cache lines? The cache line might not be known at compile-time (e.g.
> ARM big.LITTLE), but it might still make sense to do software
> prefetching.

Sure, I think that would make sense.

>> Code uses the ``getMax*Buffers`` APIs to judge whether streaming
>> should be done at all.  For example, if the number of available
>> streams greatly outweighs the hardware available, it makes little
>> sense to do streaming.  Performance will be dominated by the streams
>> that don't make use of the hardware and the streams that do make use
>> of the hardware may actually perform worse.
>
> What count's as steam? Some processors may support streams with
> strides and/or backward stream.

Yes.  We may want some additional information here to describe the
hardware's capability.

> Is there a way on which level the number of streams are shared? For
> instance, a core might be able to track 16 streams, but if 4 threads
> are running (SMT), each can only use 4.

I suppose we could couple the streaming information to an execution
resource, similar to what is done with cache levels to express this kind
of sharing.  We haven't found a need for it but that doesn't mean it
wouldn't be useful for other/new targets.

> PowerPC's dcbt/dcbtst instruction allows explicitly specifying to the
> hardware which streams it should establish. Do the buffer counts
> include explicitly and automatically established streams? Do
> non-stream accesses (e.g. stack access) count towards

It's up to the target maintainer to decide what the numbers mean.
Obviously passes have to have some notion of what things mean.  The
thing that establishes what a "stream" is in the user program lives
outside of the system model.  It may or may not consider random stack
accesses as part of a stream.

This is definitely an area for exploration.  Since we only have machines
with two major targets, we didn't need to contend with more exotic
things.  :)

>>   class TargetMemorySystemInfo {
>>     const TargetCacheLevelInfo &getCacheLevel(unsigned Level) const;
>>
>>     /// getNumLevels - Return the number of cache levels this target has.
>>     ///
>>     unsigned getNumLevels() const;
>>
>>     /// Cache level iterators
>>     ///
>>     cachelevel_iterator cachelevel_begin() const;
>>     cachelevel_iterator cachelevel_end() const;
>
> May users of this class assume that a level refers to a specific
> cache. E.g. getCacheLevel(0) being the L1 cache. Or so they have to
> search for a cache of a specific size?

The intent is that getCacheLevel(0) is the L1 cache, getCacheLevel(1) is
the L2 cache and so on.

>>     //===--------------------------------------------------------------------===//
>>     // Stream Buffer Information
>>     //
>>     const TargetStreamBufferInfo *getStreamBufferInfo() const;
>>
>>     //===--------------------------------------------------------------------===//
>>     // Software Prefetcher Information
>>     //
>>     const TargetSoftwarePrefetcherInfo *getSoftwarePrefetcherInfo() const;
>
> Would it make sense to have one PrefetcherInfo/StreamBuffer per cache
> level? Some ISA have multiple prefetchers/prefetch instructructions
> for different levels.

Probably.  Most X86 implementations direct all data prefetches to the
same cache level so we didn't find a need to model this, but it makes
sense to allow for it.

>>   class TargetExecutionResourceInfo {
>>     /// getContained - Return information about the contained execution
>>     /// resource.
>>     ///
>>     TargetExecutionResourceInfo *getContained() const;
>>
>>     /// getNumContained - Return the number of contained execution
>>     /// resources.
>>     ///
>>     unsigned getNumContained() const;
>
> Shouldn't the level itself specify how many of resources of its there
> are, instead of its parent?
> This would make TargetExecutionEngineInfo::getNumResources() reduntant.
>
> E.g. assume that "Socket" is the outermost resource level. The number
> of sockets in the system could be returned by its
> TargetExecutionResourceInfo instead of
> TargetExecutionEngineInfo::getNumResources().

That could work I think and would probably be a bit easier to
understand.

>>   };
>>
>> Each execution resource may *contain* other execution resources.  For
>> example, a socket may contain multiple cores and a core may contain
>> multiple hardware threads (e.g. SMT contexts).  An execution resource
>> may have cache levels associated with it.  If so, that cache level is
>> private to the execution resource.  For example the first-level cache
>> may be private to a core and shared by the threads within the core,
>> the second-level cache may be private to a socket and the third-level
>> cache may be shared by all sockets.
>
> Should there be an indicator whether a resource is shared or separate.
> E.g. SMT threads (and AMD "Modules") share functional units, but
> cores/sockets do not.

Interesting idea.  I suppose we could model that with another resource
type similar to the way caches are handled.  Then the resources could be
coupled to execution resources to express the sharing.  We hadn't found
a need for this level of detail in the work we've done but it could be
useful for lots of things.

>>   /// TargetExecutionEngineInfo base class - We assume that the target
>>   /// defines a static array of TargetExecutionResourceInfo objects that
>>   /// represent all of the execution resources that the target has.  As
>>   /// such, we simply have to track a pointer to this array.
>>   ///
>>   class TargetExecutionEngineInfo {
>>   public:
>>     typedef ... resource_iterator;
>>
>>     //===--------------------------------------------------------------------===//
>>     // Resource Information
>>     //
>>
>>     /// getResource - Get an execution resource by resource ID.
>>     ///
>>     const TargetExecutionResourceInfo &getResource(unsigned Resource) const;
>>
>>     /// getNumResources - Return the number of resources this target has.
>>     ///
>>     unsigned getNumResources() const;
>>
>>     /// Resource iterators
>>     ///
>>     resource_iterator resource_begin() const;
>>     resource_iterator resource_end() const;
>>   };
>>
>> The target execution engine allows optimizers to make intelligent
>> choices for cache optimization in the presence of parallelism, where
>> multiple threads may be competing for cache resources.
>
> Do you have examples on what optimizations make use of this
> information? It sounds like this info is relevant to the OS scheduler
> than the compiler.

Sure.  Cache blocking is one.  Let's assume an L2 cache shared among
cores.  Let's also assume the program is going to use threads within a
core.  You wouldn't want the compiler to cache block assuming the whole
size of L2, you'd want to cache block for some partition of L2 given the
execution resources the code is going to use.

>> Currently the resource iterators will walk over all resources (cores,
>> threads, etc.).  Alternatively, we could say that iterators walk over
>> "top level" resources and contained resources must be accessed via
>> their containing resources.
>
> Most of the time programs are not compiled for specific system
> configurations (number of sockets, how many cores your processor has,
> or how many threads the OS allows the program to run). Meaning this
> information will usually be unknown at compile-time.
> What is the intention? Pass the system configuration as flag to the
> processor? Is it only available while JITing?

On our machines it is very common for customers to compile for specific
system configurations and we provide pre-canned compiler configurations
to make it convenient to do so.  Every 1% speedup matters in HPC.  :)

This certainly could be used in a JIT but that wasn't the motivation for
the design.

>> Here we see one of the flaws in the model.  Because of the way
>> ``Socket``, ``Module`` and ``Thread`` are defined above, we're forced
>> to include a ``Module`` level even though it really doesn't make sense
>> for our ShyEnigma processor.  A ``Core`` has two ``Thread`` resources,
>> a ``Module`` has one ``Core`` resource and a ``Socket`` has eight
>> ``Module`` resources.  In reality, a ShyEnigma core has two threads
>> and a ShyEnigma socket has eight cores.  At least for this SKU (more
>> on that below).
>
> Is this a restriction of TableGen? If the "Module" level is not
> required, could the SubtargetInfo just return Socket->Thread. Or is
> there a global requirement that every architecture has to define the
> same number of level?

No, the number of levels isn't fixed.  The issue is the way that Socket
is defined:

  class Module<int numcores> : ExecutionResource<"Module", "Core", numcores>;
  class Socket<int nummodules> : ExecutionResource<"Socket", "Module", nummodules>;

It refers to "Module" by name.  The TableGen backend picks up on this
and connects the resources appropriately.  This is definitely something
that will need work as patches are developed.  It's possible that your
idea of e.g. shared function units above could capture this.

>> An open question is how to handle different SKUs within a subtarget
>> family.  We modeled the limited number of SKUs used in our products
>> via multiple subtargets, so this wasn't a heavy burden for us, but a
>> more robust implementation might allow for multiple ``MemorySystem``
>> and/or ``ExecutionEngine`` models for a given subtarget.  It's not yet
>> clear whether that's a good/necessary thing and if it is, how to
>> specify it with a compiler switch.  ``-mcpu=shy-enigma
>> -some-switch-to-specify-memory-and-execution-models``?  It may very
>> well be sufficient to have a general system model that applies
>> relatively well over multiple SKUs.
>
> Adding more specific subtargets with more refined execution models
> seem fine for me.
> But is it reasonable to manage a database of all processors ever
> produced in the compiler?

No it is not.  :)  That's why this is an open question.  We've found it
perfectly adequate to define a single model for each major processor
generation, but as I said we use a limited number of SKUs.  We will
need input from the community on this.