[llvm-dev] RFC: System (cache, etc.) model for LLVM

Mon Nov 5 09:08:03 PST 2018

On Mon, 5 Nov 2018 at 15:56, David Greene <dag at cray.com> wrote:
> The cache interfaces are flexible enough to allow passes to answer
> questions like, "how much effective cache is available for this core
> (thread, etc.)?"  That's a critical question to reason about the
> thrashing behavior you mentioned above.
>
> Knowing the cache line size is important for prefetching and various
> other memory operations such as streaming.
>
> Knowing the number of ways can allow one to guesstimate which memory
> accesses are likely to collide in the cache.
>
> It also happens that all of these parameters are useful for simulation
> purposes, which may help projects like llvm-mca.

I see.

So, IIGIR, initially, this would consolidate the prefetching
infrastructure, which is a worthy goal in itself and would require a
minimalist implementation for now.

But later, vectorisers could use that info, for example, to understand
how much would be beneficial to unroll vectorised loops (where total
access size should be a multiple of the cache line), etc.

Ultimately, simulations would be an interesting use of it, but
shouldn't be a driving force for additional features bundled into the
initial design.

> I'm not quite grasping this.  Are you saying that a partcular subtarget
> may have multiple "clusters" of big.LITTLE cores and that each cluster
> may look different from the others?

Yeah, "big.LITTLE" [1] is a marketing name and can mean a bunch of
different scenarios.

For example:
 - List of big+little cores seen by the kernel as a single core but
actually being two separate cores, and scheduled by the kernel via
frequency scaling.
 - Two entirely separate clusters flipped between all big or all little
 - Heterogeneous mix, which could have different number of big and
little cores with no cache need of coherence between them. Junos have
two little and four big, Tegras have one little and four big. There
are also other designs with dozens of huge cores plus a tiny core for
management purposes.

But it's worse, because different releases of the same family can have
different core counts, change model (clustered/bundled/heterogeneous)
and there's no way to currently represent that in table-gen.

Given that the kernel has such a high influence how those cores get
scheduled and preempted, I don't think there's any hope that the
compiler can do a good job at predicting usage or having any real
impact amidst higher level latency, such as context switches and
systemcalls.

-- 
cheers,
--renato

[1] https://en.wikipedia.org/wiki/ARM_big.LITTLE