[PATCH] D57504: RFC: Prototype & Roadmap for vector predication in LLVM

Wed Feb 12 08:23:52 PST 2020

lkcl added a comment.

In D57504#1872310 <https://reviews.llvm.org/D57504#1872310>, @simoll wrote:

> I think i wasn't clear: what i meant to say is that we will not decide how MVL is defined/queried/set in the scope of this RFC... potentially leading to the situation that every target comes with its own set of target intrinsics to do so.

ah yes got you.

>>> This would allow you (and ARM MVE/SVE , RISC-V V) to have their own mechanism for setting/querying `MVL`.
>> 
>> and x86-for-hinting-the-SIMD-length.
> 
> For x86 with scalable types, yes. For "classic" SIMD types `MVL == W` of `<W x type>`

mmm... i don't believe that's a wise choice / decision / assumption.  i am partly-guessing-and-making-architectural-assumptions here: imagine that the (very-well-informed) programmer knows how the pipelines of a particular processor work (and i do mean very well), they know that there are a couple of separate pipelines, one which handles e.g. NxFP32, one which handles MxFP64, but that if you issue SIMD instructions of width N=Mx2, it will result in a "blockage" (stall) and under-utilisation.

*however*... if you issue *half* the workload (i.e. MVL == W/2) for the FP32 instructions interleaved with "full" workload (MVL==W for the FP64 ops), *then*, because of the way that the architecture works the two suites of instructions *will* go to the separate pipelines, *will* get done in parallel, because you're not overloading the exact same 64-bit-wide pipeline entrypoint if you'd done... you get what i'm trying to say?

i think what i'm trying to say works better for MMX (the instructions which shared the FP regfile with SIMD instructions, is that right? or is it SSE?) - there you definitely want control over how much of the regfile is allocated to SIMD and how much remains actual for scalar-FP usage, and if MVL == W as a hard-coded assumption, with no "hint", you could end up taking up far more of the FP regfile for SIMD MMX than is efficient / effective.

however... if the compiler could be *explicitly* told, "hey i want you to use only W/2 or W/4 worth of the FP regfile for SIMD operations please, and to automatically create a 2x or 4x loop that makes up for it *as if* you had done a full MVL==W single SIMD instruction", then it becomes possible to create a balance there which will not hammer the L1 <https://reviews.llvm.org/L1>/L2 cache with LD/ST operations, consuming far more power than necessary, because the SIMD instructions completely dominate the entirety of the FP regfile.

we quickly learned from 3D workloads that they are very computationally-intensive and fit a "LD, massive-amounts-of-SIMD-processing, ST" pattern with *very* little in the way of overlaps.  consequently, if the compiler generates:

- LD
- half-the-processing-because-there's-not-enough-registers
- ST-some-temps
- do-some-more-processing
- LD-out-of-temps, do-a-bit-more-processing
- ST

this is horribly, horribly power-inefficient.

so being able to balance the workload, keep things entirely in the regfile even if it means using half-wide (or quarter-wide) SIMD ops and the loops taking twice or 4 times longer in order to avoid the spill into temporary LD/STs, this is far more important than trying to make "individual" SIMD operations (ones that consume far too much of the regfile and result in LD/ST "spill") as wide as possible.

again, however: i'm raising this not to suggest that it be part of *this* RFC, i'm just document it to make sure it's not forgotten, for later.

>>> Besides, i think that defining `MVL` is out of the scope of this RFC given the diversity of scalable vector ISAs right now..
>> 
>> this is cool and exciting.
> 
> Yep, and we wouldn't get near the level of support for this RFC otherwise.

yehyeh.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D57504/new/

https://reviews.llvm.org/D57504