[llvm-dev] RFC: System (cache, etc.) model for LLVM

Thu Nov 1 10:30:14 PDT 2018

Hi,

thank you for sharing the system hierarchy model. IMHO it makes a lot
of sense, although I don't know which of today's passes would make use
of it. Here are my remarks.

I am wondering how one could model the following features using this
model, or whether they should be part of a performance model at all:

 * ARM's big.LITTLE

 * NUMA hierarchies (are the NUMA domains 'caches'?)

 * Total available RAM

 * remote memory (e.g. RAM on an accelerator mapped into the address space)

 * scratch pad

 * write-back / write-through write buffers

 * page size

 * TLB capacity

 * constructive/destructive interference
(https://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size)
   Some architecture have instructions to zero entire cache lines,
e.g. dcbz on PowerPC, but it requires the cache line to be correct.
Also see https://www.mono-project.com/news/2016/09/12/arm64-icache/

 * Instruction cache

Am Di., 30. Okt. 2018 um 15:27 Uhr schrieb David Greene via llvm-dev
<llvm-dev at lists.llvm.org>:
>   class TargetCacheLevelInfo {
>     /// getWays - Return the number of ways.
>     ///
>     unsigned getWays() const;

That is, associativity?

Bandwidth might be a useful addition, e.g. if a performance analysis
tools uses the roofline model.

>   class TargetSoftwarePrefetcherInfo {
>     /// Should we do software prefetching at all?
>     ///
>     bool isEnabled() const;

isEnabled sounds like something configurable at runtime.

>     /// Provide a general prefetch distance hint.
>     ///
>     unsigned getDistance() const;
>
>     /// Prefetch at least this far ahead.
>     ///
>     unsigned getMinDistance() const;
>
>     /// Prefetch at most this far ahead.
>     ///
>     unsigned getMaxDistance() const;
>   };
>
> ``get*Distance`` APIs provide general hints to guide the software
> prefetcher.  The software prefetcher may choose to ignore them.
> getMinDistance and getMaxDistance act as clamps to ensure the software
> prefetcher doesn't do something wholly inappropriate.
>
> Distances are specified in terms of cache lines.  The current
> ``TargetTransformInfo`` interfaces speak in terms of instructions or
> iterations ahead.  Both can be useful and so we may want to add
> iteration and/or instruction distances to this interface.

Would it make sense to specify a prefetch distance in bytes instead of
cache lines? The cache line might not be known at compile-time (e.g.
ARM big.LITTLE), but it might still make sense to do software
prefetching.

>   class TargetStreamBufferInfo {
>     /// getNumLoadBuffers - Return the number of load buffers available.
>     /// This is the number of simultaneously active independent load
>     /// streams the processor can handle before degrading performance.
>     ///
>     int getNumLoadBuffers() const;
>
>     /// getMaxNumLoadBuffers - Return the maximum number of load
>     /// streams that may be active before shutting off streaming
>     /// entirely.  -1 => no limit.
>     ///
>     int getMaxNumLoadBuffers();
>
>     /// getNumStoreBuffers - Return the effective number of store
>     /// buffers available.  This is the number of simultaneously
>     /// active independent store streams the processor can handle
>     /// before degrading performance.
>     ///
>     int getNumStoreBuffers() const;
>
>     /// getMaxNumStoreBuffers - Return the maximum number of store
>     /// streams that may be active before shutting off streaming
>     /// entirely.  -1 => no limit.
>     ///
>     int getMaxNumStoreBuffers() const;
>
>     /// getNumLoadStoreBuffers - Return the effective number of
>     /// buffers available for streams that both load and store data.
>     /// This is the number of simultaneously active independent
>     /// load-store streams the processor can handle before degrading
>     /// performance.
>     ///
>     int getNumLoadStoreBuffers() const;
>
>     /// getMaxNumLoadStoreBuffers - Return the maximum number of
>     /// load-store streams that may be active before shutting off
>     /// streaming entirely.  -1 => no limit.
>     ///
>     int getMaxNumLoadStoreBuffers() const;
>   };
>
> Code uses the ``getMax*Buffers`` APIs to judge whether streaming
> should be done at all.  For example, if the number of available
> streams greatly outweighs the hardware available, it makes little
> sense to do streaming.  Performance will be dominated by the streams
> that don't make use of the hardware and the streams that do make use
> of the hardware may actually perform worse.

What count's as steam? Some processors may support streams with
strides and/or backward stream.

Is there a way on which level the number of streams are shared? For
instance, a core might be able to track 16 streams, but if 4 threads
are running (SMT), each can only use 4.

PowerPC's dcbt/dcbtst instruction allows explicitly specifying to the
hardware which streams it should establish. Do the buffer counts
include explicitly and automatically established streams? Do
non-stream accesses (e.g. stack access) count towards

>   class TargetMemorySystemInfo {
>     const TargetCacheLevelInfo &getCacheLevel(unsigned Level) const;
>
>     /// getNumLevels - Return the number of cache levels this target has.
>     ///
>     unsigned getNumLevels() const;
>
>     /// Cache level iterators
>     ///
>     cachelevel_iterator cachelevel_begin() const;
>     cachelevel_iterator cachelevel_end() const;

May users of this class assume that a level refers to a specific
cache. E.g. getCacheLevel(0) being the L1 cache. Or so they have to
search for a cache of a specific size?

>     //===--------------------------------------------------------------------===//
>     // Stream Buffer Information
>     //
>     const TargetStreamBufferInfo *getStreamBufferInfo() const;
>
>     //===--------------------------------------------------------------------===//
>     // Software Prefetcher Information
>     //
>     const TargetSoftwarePrefetcherInfo *getSoftwarePrefetcherInfo() const;

Would it make sense to have one PrefetcherInfo/StreamBuffer per cache
level? Some ISA have multiple prefetchers/prefetch instructructions
for different levels.

>   class TargetExecutionResourceInfo {
>     /// getContained - Return information about the contained execution
>     /// resource.
>     ///
>     TargetExecutionResourceInfo *getContained() const;
>
>     /// getNumContained - Return the number of contained execution
>     /// resources.
>     ///
>     unsigned getNumContained() const;

Shouldn't the level itself specify how many of resources of its there
are, instead of its parent?
This would make TargetExecutionEngineInfo::getNumResources() reduntant.

E.g. assume that "Socket" is the outermost resource level. The number
of sockets in the system could be returned by its
TargetExecutionResourceInfo instead of
TargetExecutionEngineInfo::getNumResources().

>   };
>
> Each execution resource may *contain* other execution resources.  For
> example, a socket may contain multiple cores and a core may contain
> multiple hardware threads (e.g. SMT contexts).  An execution resource
> may have cache levels associated with it.  If so, that cache level is
> private to the execution resource.  For example the first-level cache
> may be private to a core and shared by the threads within the core,
> the second-level cache may be private to a socket and the third-level
> cache may be shared by all sockets.

Should there be an indicator whether a resource is shared or separate.
E.g. SMT threads (and AMD "Modules") share functional units, but
cores/sockets do not.

>   /// TargetExecutionEngineInfo base class - We assume that the target
>   /// defines a static array of TargetExecutionResourceInfo objects that
>   /// represent all of the execution resources that the target has.  As
>   /// such, we simply have to track a pointer to this array.
>   ///
>   class TargetExecutionEngineInfo {
>   public:
>     typedef ... resource_iterator;
>
>     //===--------------------------------------------------------------------===//
>     // Resource Information
>     //
>
>     /// getResource - Get an execution resource by resource ID.
>     ///
>     const TargetExecutionResourceInfo &getResource(unsigned Resource) const;
>
>     /// getNumResources - Return the number of resources this target has.
>     ///
>     unsigned getNumResources() const;
>
>     /// Resource iterators
>     ///
>     resource_iterator resource_begin() const;
>     resource_iterator resource_end() const;
>   };
>
> The target execution engine allows optimizers to make intelligent
> choices for cache optimization in the presence of parallelism, where
> multiple threads may be competing for cache resources.

Do you have examples on what optimizations make use of this
information? It sounds like this info is relevant to the OS scheduler
than the compiler.

> Currently the resource iterators will walk over all resources (cores,
> threads, etc.).  Alternatively, we could say that iterators walk over
> "top level" resources and contained resources must be accessed via
> their containing resources.

Most of the time programs are not compiled for specific system
configurations (number of sockets, how many cores your processor has,
or how many threads the OS allows the program to run). Meaning this
information will usually be unknown at compile-time.
What is the intention? Pass the system configuration as flag to the
processor? Is it only available while JITing?

> Here we see one of the flaws in the model.  Because of the way
> ``Socket``, ``Module`` and ``Thread`` are defined above, we're forced
> to include a ``Module`` level even though it really doesn't make sense
> for our ShyEnigma processor.  A ``Core`` has two ``Thread`` resources,
> a ``Module`` has one ``Core`` resource and a ``Socket`` has eight
> ``Module`` resources.  In reality, a ShyEnigma core has two threads
> and a ShyEnigma socket has eight cores.  At least for this SKU (more
> on that below).

Is this a restriction of TableGen? If the "Module" level is not
required, could the SubtargetInfo just return Socket->Thread. Or is
there a global requirement that every architecture has to define the
same number of level?

> An open question is how to handle different SKUs within a subtarget
> family.  We modeled the limited number of SKUs used in our products
> via multiple subtargets, so this wasn't a heavy burden for us, but a
> more robust implementation might allow for multiple ``MemorySystem``
> and/or ``ExecutionEngine`` models for a given subtarget.  It's not yet
> clear whether that's a good/necessary thing and if it is, how to
> specify it with a compiler switch.  ``-mcpu=shy-enigma
> -some-switch-to-specify-memory-and-execution-models``?  It may very
> well be sufficient to have a general system model that applies
> relatively well over multiple SKUs.

Adding more specific subtargets with more refined execution models
seem fine for me.
But is it reasonable to manage a database of all processors ever
produced in the compiler?

Michael