[llvm-dev] RFC: System (cache, etc.) model for LLVM

David Greene via llvm-dev llvm-dev at lists.llvm.org
Tue Oct 30 13:26:45 PDT 2018


Hi everyone,

During the loop optimization roundtable at this year's Bay Area
conference, Hal noted that even the most basic of cost modeling is not
possible due to the lack of cache models within LLVM.  This topic has
been raised many times in the past.  Cray is now in a position where we
can contribute the system model we've been using successfully in our
compiler for many years.  This RFC describes the current design with a
few changes to update it to existing ToT interfaces.

We anticipate lots of feedback and suggested changes.  This RFC is a
strawman proposal to get the conversation started.  We would not try
to submit this all at once but would rather do so via incremental
patches.  It seems most helpful to everyone to hash out a design ahead
of time rather than submitting patches without a clear understanding
of where things are going.

============================
Target system model for LLVM
============================

Abstract
========
LLVM currently has an ad-hoc method for specifying target system
parameters such as cache characteristics and software prefetching
configurations, with ``TargetTransformInfo`` interfaces to get at
them.  The following proposes to abstract such parameters into a
general target system model described by TableGen classes and easily
specified by target maintainers.

Design goals
============
- Describe target cache architectures sufficiently to allow loop-based
  cache optimization

- Describe target-specific prefetch heuristics

- Abstract target hardware memory buffers to drive streaming
  heuristics (use of non-temporal operations, etc.)

- Make the models relatively easy to implement for multiple targets

API design
==========
Six classes describe the memory model, each of which will be detailed
below.  The proposal elides implementation details, concentrating
instead on the developer API.  As we refine this and post patches
we'll of course have plenty of opportunity to discuss implementation.

- ``TargetCacheLevelInfo``
- ``TargetSoftwarePrefetcherInfo``
- ``TargetStreamBufferInfo``
- ``TargetExecutionResourceInfo``
- ``TargetExecutionEngineInfo``
- ``TargetMemorySystemInfo``

Existing interfaces in ``TargetTransformInfo`` (``getCacheLineSize``,
etc.) can be directed to the interfaces below.  Because the model is
expressed as TableGen classes (explained below), the existing
hard-coded values in e.g. ``AArch64Subtarget::initializedProperties``
can be "lifted" into a (hopefully) more flexible and
easier-to-maintain framework.

Memory system model
-------------------

``TargetCacheLevelInfo`` provides a view of a single level of cache.  It
has the following interface::

  /// TargetCacheLevelInfo - Provide information about a specific level
  /// in the cache (size, associativity, etc.).
  ///
  class TargetCacheLevelInfo {
    [...]

  public:
    /// getID() - Return the cache level ID number.
    ///
    unsigned getID() const;

    /// getName() - Return the cache level name for debugging.
    ///
    const char *getName() const;

    /// getSize - Return the size of the cache level in bytes.
    ///
    unsigned getSize() const;

    /// getLineSize - Return the size of the cache line in bytes.
    ///
    unsigned getLineSize() const;

    /// getWays - Return the number of ways.
    ///
    unsigned getWays() const;

    /// getLatency - Return the expected latency of a load that hits in
    /// this level of cache.
    ///
    unsigned getLatency() const;
  };

``TargetSoftwarePrefercherInfo`` holds target-specific prefetching
heuristics::

  /// TargetSoftwarePrefetcherInfo - Provide information about how to
  /// configure the software prefetcher.
  ///
  class TargetSoftwarePrefetcherInfo {
    [...]
  public:
    /// getID() - Return the prefetcher ID number.
    ///
    unsigned getID() const;

    /// getName() - Return the prefetcher name for debugging.
    ///
    const char *getName() const;

    /// Should we do software prefetching at all?
    ///
    bool isEnabled() const;

    /// Provide a general prefetch distance hint.
    ///
    unsigned getDistance() const;

    /// Prefetch at least this far ahead.
    ///
    unsigned getMinDistance() const;

    /// Prefetch at most this far ahead.
    ///
    unsigned getMaxDistance() const;
  };

``get*Distance`` APIs provide general hints to guide the software
prefetcher.  The software prefetcher may choose to ignore them.
getMinDistance and getMaxDistance act as clamps to ensure the software
prefetcher doesn't do something wholly inappropriate.

Distances are specified in terms of cache lines.  The current
``TargetTransformInfo`` interfaces speak in terms of instructions or
iterations ahead.  Both can be useful and so we may want to add
iteration and/or instruction distances to this interface.

``TargetStreamBufferInfo`` describes target hardware for memory
streaming::

  class TargetStreamBufferInfo {
    [...]

  public:
    /// getID() - Return the stream buffer ID number.
    ///
    unsigned getID() const;

    /// getName() - Return the stream buffer name for debugging.
    ///
    const char *getName() const;

    /// getNumLoadBuffers - Return the number of load buffers available.
    /// This is the number of simultaneously active independent load
    /// streams the processor can handle before degrading performance.
    ///
    int getNumLoadBuffers() const;

    /// getMaxNumLoadBuffers - Return the maximum number of load
    /// streams that may be active before shutting off streaming
    /// entirely.  -1 => no limit.
    ///
    int getMaxNumLoadBuffers();

    /// getNumStoreBuffers - Return the effective number of store
    /// buffers available.  This is the number of simultaneously
    /// active independent store streams the processor can handle
    /// before degrading performance.
    ///
    int getNumStoreBuffers() const;

    /// getMaxNumStoreBuffers - Return the maximum number of store
    /// streams that may be active before shutting off streaming
    /// entirely.  -1 => no limit.
    ///
    int getMaxNumStoreBuffers() const;

    /// getNumLoadStoreBuffers - Return the effective number of
    /// buffers available for streams that both load and store data.
    /// This is the number of simultaneously active independent
    /// load-store streams the processor can handle before degrading
    /// performance.
    ///
    int getNumLoadStoreBuffers() const;

    /// getMaxNumLoadStoreBuffers - Return the maximum number of
    /// load-store streams that may be active before shutting off
    /// streaming entirely.  -1 => no limit.
    ///
    int getMaxNumLoadStoreBuffers() const;
  };

Code uses the ``getMax*Buffers`` APIs to judge whether streaming
should be done at all.  For example, if the number of available
streams greatly outweighs the hardware available, it makes little
sense to do streaming.  Performance will be dominated by the streams
that don't make use of the hardware and the streams that do make use
of the hardware may actually perform worse.

Class ``TargetMemorySystemInfo`` aggregates all of the information
about the target memory system::

  /// TargetMemorySystemInfo - We assume that the target defines a
  /// static array of TargetCacheLevel objects that represent all of
  /// the cache levels that the target has.  As such, we simply have
  /// to track a pointer to this array.
  ///
  class TargetMemorySystemInfo {
    [...]
  public:
    typedef ... cachelevel_iterator;

    /// getID() - Return the memory system ID number.
    ///
    unsigned getID() const;

    /// getName() - Return the register class name for debugging.
    ///
    const char *getName() const;

    //===--------------------------------------------------------------------===//
    // Cache Level Information
    //

    /// Get the information for cache level Level (0-based).
    ///
    const TargetCacheLevelInfo &getCacheLevel(unsigned Level) const;

    /// getNumLevels - Return the number of cache levels this target has.
    ///
    unsigned getNumLevels() const;

    /// Cache level iterators
    ///
    cachelevel_iterator cachelevel_begin() const;
    cachelevel_iterator cachelevel_end() const;

    //===--------------------------------------------------------------------===//
    // Stream Buffer Information
    //
    const TargetStreamBufferInfo *getStreamBufferInfo() const;

    //===--------------------------------------------------------------------===//
    // Software Prefetcher Information
    //
    const TargetSoftwarePrefetcherInfo *getSoftwarePrefetcherInfo() const;
  };

Execution engine model
----------------------
The memory model exists in conjunction with an execution engine model.
The execution engine currently describes how cache levels map to
execution resources.  It could hold more information outside the
memory system model (e.g. scheduling models) but currently does not.
It could become the place where information for existing
``TargetTransformInfo`` interfaces like ``getMaxInterleaveFactor``
live.

Class ``TargetExecutionResourceInfo`` provides information about a
*unit of execution*::

  /// TargetExecutionResourceInfo - Provide information about a specific
  /// kind of execution resource (core, thread, etc.).
  ///
  class TargetExecutionResourceInfo {
    [...]
  public:
    typedef ... cachelevel_iterator;

    /// getID() - Return the resource ID number.
    ///
    unsigned getID() const;

    /// getName() - Return the resource name for debugging.
    ///
    const char *getName() const;

    /// getContained - Return information about the contained execution
    /// resource.
    ///
    TargetExecutionResourceInfo *getContained() const;

    /// getNumContained - Return the number of contained execution
    /// resources.
    ///
    unsigned getNumContained() const;

    /// Iterate over the levels of cache private to this resource.
    ///
    cachelevel_iterator cachelevel_begin() const;
    cachelevel_iterator cachelevel_end() const;
  };

Each execution resource may *contain* other execution resources.  For
example, a socket may contain multiple cores and a core may contain
multiple hardware threads (e.g. SMT contexts).  An execution resource
may have cache levels associated with it.  If so, that cache level is
private to the execution resource.  For example the first-level cache
may be private to a core and shared by the threads within the core,
the second-level cache may be private to a socket and the third-level
cache may be shared by all sockets.

Class ``TargetExecutionEngineInfo`` aggregates execution resources
into a unified view of the target::

  /// TargetExecutionEngineInfo base class - We assume that the target
  /// defines a static array of TargetExecutionResourceInfo objects that
  /// represent all of the execution resources that the target has.  As
  /// such, we simply have to track a pointer to this array.
  ///
  class TargetExecutionEngineInfo {
  public:
    typedef ... resource_iterator;

    //===--------------------------------------------------------------------===//
    // Resource Information
    //

    /// getResource - Get an execution resource by resource ID.
    ///
    const TargetExecutionResourceInfo &getResource(unsigned Resource) const;

    /// getNumResources - Return the number of resources this target has.
    ///
    unsigned getNumResources() const;

    /// Resource iterators
    ///
    resource_iterator resource_begin() const;
    resource_iterator resource_end() const;
  };

The target execution engine allows optimizers to make intelligent
choices for cache optimization in the presence of parallelism, where
multiple threads may be competing for cache resources.

Currently the resource iterators will walk over all resources (cores,
threads, etc.).  Alternatively, we could say that iterators walk over
"top level" resources and contained resources must be accessed via
their containing resources.

Subtarget APIs
--------------
Memory and execution system info are accessed through new
TargetSubtarget methods::

  /// getMemoryInfo - If memory information is available, return it.
  /// If not, return null.  Memory information can include cache
  /// hierarchy organization as well as more exotic things.
  ///
  const TargetMemorySystemInfo &getMemorySystemInfo() const {
    return *MemoryInfo;
  }

  /// getExecutionEngineInfo - If execution engine information is
  /// available, return it.  If not, return null.
  //
  const TargetExecutionEngineInfo &getExecutionEngineInfo() const {
    return *ExecutionEngineInfo;
  }

Target model implementation
===========================
In order to make describing a target memory model (relatively)
straightforward, we modified TableGen's ``SubtargetEmitter`` to allow
high-level memory model and execution resource descriptions.

Each target model inherits from a set of base TableGen classes.

Memory model base classes
-------------------------
A number of base TableGen classes exist which may be composed into
memory system descriptions::

  //===----------------------------------------------------------------------===//
  // Int - Give a name to an integer value for clarity.
  //===----------------------------------------------------------------------===//
  class Int<int value> {
    int Value = value;
  }

  //===----------------------------------------------------------------------===//
  // Cache Level - This represents a specific level within the cache hierarchy.
  //===----------------------------------------------------------------------===//
  class CacheLevel<string name, int size, int linesize, int ways, int latency> {
    string Name = name;
    int Size = size;
    int LineSize = linesize;
    int Ways = ways;
    int Latency = latency;
  }

  //===----------------------------------------------------------------------===//
  // Cache hierarchy - This models a specific cache hierarchy.
  //===----------------------------------------------------------------------===//
  class CacheHierarchy<list<CacheLevel> levels> {
    list<CacheLevel> Levels = levels;
  }

  def NoCaches : CacheHierarchy<[]>;

  //===----------------------------------------------------------------------===//
  // Common cache sizes
  //===----------------------------------------------------------------------===//
  def _1K   : Int<1024>;
  def _16K  : Int<16384>;
  def _32K  : Int<32768>;
  def _64K  : Int<65536>;
  def _128K : Int<131072>;
  def _256K : Int<262144>;
  def _512K : Int<524288>;
  def _1M   : Int<1048576>;
  def _2M   : Int<2097152>;
  def _4M   : Int<4194304>;
  def _6M   : Int<6291456>;
  def _8M   : Int<8388608>;
  def _12M  : Int<12582912>;
  def _16M  : Int<16777216>;
  def _20M  : Int<20971520>;
  def _25M  : Int<26214400>;
  def _32M  : Int<33554432>;
  def _40M  : Int<41943040>;

  class L1Cache<int size, int linesize, int ways, int latency> :
    CacheLevel<"L1", size, linesize, ways, latency>;
  class L2Cache<int size, int linesize, int ways, int latency> :
    CacheLevel<"L2", size, linesize, ways, latency>;
  class L3Cache<int size, int linesize, int ways, int latency> :
    CacheLevel<"L3", size, linesize, ways, latency>;

  //===----------------------------------------------------------------------===//
  // StreamBuffer - This models streaming (non-temporal) resources
  //
  // Load              - The number of load buffers available.
  // MaxLoad           - The maximum number of active load streams before
  //                     streaming is shut off entirely (-1 => no limit).
  // Store             - The number of store buffers available.
  // MaxStore          - The maximum number of active store streams before
  //                     streaming is shut off entirely (-1 => no limit).
  // LoadStore         - The number of load-store buffers available.
  // MaxStore          - The maximum number of active load-store streams before
  //                     streaming is shut off entirely (-1 => no limit).
  //===----------------------------------------------------------------------===//
  def Infinite : Int<-1>;

  class StreamBuffer<int l, int ml,
                     int s, int ms,
                     int ls, int mls> {
    int Load = l;
    int MaxLoad = ml;
    int Store = s;
    int MaxStore = ms;
    int LoadStore = ls;
    int MaxLoadStore = mls;
  }

  def DefaultStreamBuffers : StreamBuffer<0, Infinite.Value,  // Load
                                          0, Infinite.Value,  // Store
                                          0, Infinite.Value>; // Load-Store

  //===----------------------------------------------------------------------===//
  // SoftwarePrefetcher - This provides parameters to software prefetching.
  //===----------------------------------------------------------------------===//
  class SoftwarePrefetcher<int e, int d, int nd, int xd> {
    int Enabled = e;
    int Distance = d;
    int MinDistance = nd;
    int MaxDistance = xd;
  }

  def DefaultSoftwarePrefetcher : SoftwarePrefetcher<0, 0, 0, 0>;

  //===----------------------------------------------------------------------===//
  // MemorySystem - This models a memory subsystem.
  //===----------------------------------------------------------------------===//
  class MemorySystem<CacheHierarchy c, StreamBuffer s, SoftwarePrefetcher p> {
    CacheHierarchy Caches = c;
    StreamBuffer StreamBuffers = s;
    SoftwarePrefetcher Prefetcher = p;
  }

  def DefaultMemorySystem : MemorySystem<NoCaches, DefaultStreamBuffers,
                                         DefaultSoftwarePrefetcher>;

Execution engine base classes
-----------------------------
Similarly, classes exist for describing and composing execution
resources.

The ``Module`` execution resource below exists for AMD's Interlagos,
which is a bit of an oddball in that independent integer units drive a
shared floating-point unit.  A ``Module`` in that case captures the two
integer units + shared FP unit while a ``Core`` models the independent
integer units.  The community may not care about this distinction or
it may be a useful abstraction for various targets.

::

  //===----------------------------------------------------------------------===//
  // Execution resource - This models a particular group of execution resources
  // (threads, cores, etc.).
  //===----------------------------------------------------------------------===//
  class ExecutionResource<string name, string contained, int num> {
    string Name = name;
    string Contined = contained;
    int NumContained = num;
  }

  //===----------------------------------------------------------------------===//
  // Common execution resources
  //===----------------------------------------------------------------------===//
  class Thread : ExecutionResource<"Thread", "", 0>;
  class Core<int numthreads> : ExecutionResource<"Core", "Thread", numthreads>;
  class Module<int numcores> : ExecutionResource<"Module", "Core", numcores>;
  class Socket<int nummodules> : ExecutionResource<"Socket", "Module", nummodules>;

  //===----------------------------------------------------------------------===//
  // Execution module - This couples an ExecutionResource with CacheLevels.
  // The given cache levels are all private to the ExecutionResource and
  // shared by any contained ExecutionResources.
  //===----------------------------------------------------------------------===//
  class ExecutionModule<ExecutionResource resource, list<CacheLevel> cache> {
    ExecutionResource Resource = resource;
    list<CacheLevel> Cache = cache;
  }

  //===----------------------------------------------------------------------===//
  // Execution engine - This models a collection of execution modules.
  //===----------------------------------------------------------------------===//
  class ExecutionEngine<list<ExecutionModule> resources> {
    list<ExecutionModule> Resources = resources;
  }

  def NoExecutionEngine : ExecutionEngine<[]>;

Example target model
--------------------
As an example, here is a definition of a memory and execution resource
model for a fictitious processor, modeled after how the X86 targets
are organized::

  //===-------------------------
  // ShyEnigma execution engine

  // Give an unlimited number of load buffers since the hardware
  // simply sets hint bits in the cache.  There are 4 hardware store
  // streaming buffers.  Set load-store buffer counts to store
  // buffer counts since that's the limiting resource.
  //
  def ShyEnigmaStreamBuffers : StreamBuffer<Infinite.Value, Infinite.Value,  // Load
                                            4,              Infinite.Value,  // Store
                                            4,              Infinite.Value>; // Load-Store

  // Enable data prefetching.  Generally prefetch 8 lines ahead but never
  // more than 32 lines ahead.
  //
  def ShyEnigmaSoftwarePrefetcher : SoftwarePrefetcher<1, 8, 0, 32>;

  // Cache lines are 64 bytes with latencies of 1, 9 and 11 cycles.
  //
  def ShyEnigmaCacheHierarchy : CacheHierarchy<[
    L1Cache<_8K.Value, 64, 4, 1>,
    L2Cache<_32K.Value, 64, 8, 9>,
    L3Cache<_11M.Value, 64, 10, 11>
  ]>;

  def ShyEnigmaMemorySystem : MemorySystem<
    ShyEnigmaCacheHierarchy, ShyEnigmaStreamBuffers, ShyEnigmaSoftwarePrefetcher
  >;

  // Cache levels 1 and 2 are private to a core and shared by threads in the core.
  // Cache level 3 is shared by all cores.
  //
  // Each ShyEnigma socket has 8 cores with 2 threads per core.
  //
  def ShyEnigmaExecutionEngine : ExecutionEngine<[
    ExecutionModule<Core<2>,   [ShyEnigmaCacheHierarchy.Levels[0],
                                ShyEnigmaCacheHierarchy.Levels[1]]>,
    ExecutionModule<Module<1>, []>, // FIXME: Module and Core definitions above
                                    // force this though it's not needed.
    ExecutionModule<Socket<8>, [ShyEnigmaCacheHierarchy.Levels[2]]>
  ]>;

Here we see one of the flaws in the model.  Because of the way
``Socket``, ``Module`` and ``Thread`` are defined above, we're forced
to include a ``Module`` level even though it really doesn't make sense
for our ShyEnigma processor.  A ``Core`` has two ``Thread`` resources,
a ``Module`` has one ``Core`` resource and a ``Socket`` has eight
``Module`` resources.  In reality, a ShyEnigma core has two threads
and a ShyEnigma socket has eight cores.  At least for this SKU (more
on that below).

Modified ``Processor``, ``ProcessorModel`` and ``ProcModel`` classes
tie this into the rest of the target description information (again,
following X86 target conventions)::

  class Processor<string n,
                  ProcessorItineraries pi,
                  list<SubtargetFeature> f,
                  MemorySystem m = DefaultMemorySystem,
                  ExecutionEngine e = NoExecutionEngine> {
    // Name - Chip set name.  Used by command line (-mcpu=) to determine the
    // appropriate target chip.
    //
    string Name = n;

    // SchedModel - The machine model for scheduling and instruction cost.
    //
    SchedMachineModel SchedModel = NoSchedModel;

    // ProcItin - The scheduling information for the target processor.
    //
    ProcessorItineraries ProcItin = pi;

    // Features - list of
    list<SubtargetFeature> Features = f;

    // Memory - Information about the memory system.
    //
    MemorySystem Memory = m;

    // Engine - A list of execution resources describing various execution
    // contexts (threads, cores, etc.) and associated caches.
    //
    ExecutionEngine Engine = e;
  }

  class ProcessorModel<string n,
                       SchedMachineModel m,
                       list<SubtargetFeature> f>
                       MemorySystem mem = DefaultMemorySystem,
                       ExecutionEngine e = NoExecutionEngine>
      : Processor<n, NoItineraries, f, mem, e> {
    let SchedModel = m;
  }

  class ProcModel<string Name, SchedMachineModel Model,
                  list<SubtargetFeature> ProcFeatures,
                  list<SubtargetFeature> OtherFeatures,
                  MemorySystem MemoryModel,
                  ExecutionEngine EngineModel> :
    ProcessorModel<Name,
                   Model,
                   !listconcat(ProcFeatures, OtherFeatures),
                   MemoryModel,
                   EngineModel>;

  class ShyEnigmaProc<string Name> : ProcModel<Name, ShyEnigmaSchedModel,
                                               ShyEnigmaFeatures.Value, [
    ProcShyEnigma,
    FeatureHasWarpSpeed,
    FeatureHasNoStalls
  ],
                                                ShyEnigmaMemorySystem,
                                                ShyEnigmaExecutionEngine>;

  def : ShyEnigma<"shy-enigma">;

An open question is how to handle different SKUs within a subtarget
family.  We modeled the limited number of SKUs used in our products
via multiple subtargets, so this wasn't a heavy burden for us, but a
more robust implementation might allow for multiple ``MemorySystem``
and/or ``ExecutionEngine`` models for a given subtarget.  It's not yet
clear whether that's a good/necessary thing and if it is, how to
specify it with a compiler switch.  ``-mcpu=shy-enigma
-some-switch-to-specify-memory-and-execution-models``?  It may very
well be sufficient to have a general system model that applies
relatively well over multiple SKUs.

TableGen changes
================
A modified TableGen subtarget emitter understands the execution engine
base classes and knows how to emit lookup tables for target resoruces.
There is an ``Emit`` method for each type of resource table
(``EmitCacheHeirarchies``, ``EmitStreamBuffers``, ``EmitPrefetchers``,
``EmitMemorySystems``, ``EmitExecutionEngines``).

We also enhanced ``EmitProcessorLookup`` to handle the mapping of
``Processor`` descriptions to resource models.

Client usage
============
A pass can get at the memory model via ``TargetSubtaget``, existing
``TargetTransformInfo`` interfaces, by new ``TargetTransformInfo``
interfaces that forward to the ``TargetSubtarget`` APIs and by new
interfaces the system model may provide in the future.


More information about the llvm-dev mailing list