[llvm-dev] Use Galois field New Instructions (GFNI) to combine affine instructions

Mon May 18 06:58:01 PDT 2020

Hello everyone,

On the last couple of days, I have been experimenting with teaching LLVM how to combine a
set of affine instructions into an instruction that uses the GFNI [1] AVX512 extension,
especially GF2P8AFFINEQB [2]. While the general idea seems to work, I have some questions
about my current implementation (see below). FTR, I have named this transformation
AffineCombineExpr (ACE).

Let's first introduce the general idea, which is to transform code like this:

static uint8_t rol2(uint8_t v) {
  return (v<<2)^(v>>6);
}
uint8_t func(uint8_t v) {
  v = rol2(v);
  v ^= 0xAA;
  return v;
}

into this:

define zeroext i8 @func(i8 zeroext %v) local_unnamed_addr #0 {
  %0 = insertelement <16 x i8> undef, i8 %v, i64 0
  %1 = call <16 x i8> @llvm.x86.vgf2p8affineqb.128(<16 x i8> %0, <16 x i8> bitcast (<2 x
i64> <i64 4647715923615551520, i64 undef> to <16 x i8>), i8 -86)
  %2 = extractelement <16 x i8> %1, i64 0
  ret i8 %2
}

(if that's profitable, which might not be the case here, see below)

Another more interesting example where we could see potential benefits is this one:
https://github.com/aguinet/llvm-project/commit/9ed424cbac0fe3566f801167e2190fad5ad07507#diff-21dd247f3b8aa49860ae8122fe3ea698R22

This gets even more interesting with vectorized code, with an example here:

* original C code: https://pastebin.com/4JjF7DPu
* LLVM IR after clang -O2 -mgfni -mavx2: https://pastebin.com/Ti0Vm2gj [3]
* LLVM IR after ACE (using opt -aggressive-instcombine -S): https://pastebin.com/2zFU7J6g
(interesting things happened at line 67)

If, like me, you don't have a GFNI-enabled CPU, you can use Intel SDE [4] to run the
compiled code.

The code of the pass is available here:

https://github.com/aguinet/llvm-project/blob/feature/gfni_combine/llvm/lib/Transforms/AggressiveInstCombine/AffineExprCombine.cpp

And there are test cases here:
https://github.com/aguinet/llvm-project/tree/feature/gfni_combine/llvm/test/Transforms/AggressiveInstCombine
(aec_*.ll)

Questions
=========

The high-level view of the algorithm is the following:

a) gather, from a basic block, suites of instructions that process an 8-bit integer using
affine instructions, and generate another 8-bit integer.
b) compute the linear matrix and affine constant related to that set of instructions
c) emit the GFNI instructions

Even thought the current code showcases the idea, there are quite a few things I'm unhappy
with and would like some advice:

1) about a): what we want is an analysis that can gather, from a given basic block, the
largest DAG*s* of instructions based on a given predicate (in our case: is this an affine
transformation?), that process an 8-bit value and output another 8-bit value.

After looking at {Aggressive,}InstCombine, I hadn't find exactly what I wanted, so I've
rewritten something from scratch to validate the overall idea. But is there some facility
within LLVM I could reuse for this purpose? This feels like for instance the same kind of
analyses that might be already done in ScheduleDAG (?).

2) profitability: according to [6], the latency of the GFNI instructions is 3 cycles. and
in the general case insertelement and extractelement also maps to 3-cycle latency
instructions [8] [9]. This makes the whole replacement a latency of 9 cycles (in the
scalar case). Is there any example on how can I compare this 9 cycles latency against the
set of instructions I am replacing against? What I do for now is really simple and I think
far from reality (see
https://github.com/aguinet/llvm-project/commit/9ed424cbac0fe3566f801167e2190fad5ad07507#diff-3a29c490bdd8d147d4044818c2da0509R115)

3) loop vectorization: related to 2), it seems that we could generate more efficient code
if this instruction combining process would happen directly within the loop vectorization
algorithm. Indeed, we could benefit from the cost model analysis that already exists
there, and also tweak the loop unrolling factor to better hide this latency of 3 cycles.

>From the documentation [7], it looks like this need to happen in the VPlan representation,
but I've had a hard time figuring out where I should plug myself in. Is there any example
that showcases instruction combining within this representation?

4) I inserted this transformation into AggressiveInstCombine, because it is indeed on
paper a combination of instructions, and the analysis to make it work can be quite costly.
That being said, it seems to be ran too early in the pipeline, and just running clang -03
-mgfni does not combine anything at all (I still got to investigate this). It might be
linked to 1) and the fact that my analysis is very naïve, and assume some "clean" and
optimized LLVM IR.
Question is: where does that kind of transformation should happen in the current
optimization pipeline?

Current limitations/issues
==========================

1) compile-time performances: I haven't run benchmarks yet to see the compile-cost of
this. Some efforts have been made on this preliminary version to drop as early as possible
basic blocks that does not seem interesting [5], but that deserves more work

2) run-time performances: if someone has a GFNI-enabled CPU and can ran some benchmarks
(for instance on the aec_vec.ll one), that could be very interesting :)

3) if we run the "vectorization" test above with avx512 (and thus 32-bytes vectors), we
generate the LLVM IR here: https://pastebin.com/Rwn43N4x, but llc crashes with this
message: https://pastebin.com/bbSZPFe5 . Am I doing something wrong or is there an actual
issue in the X86 backend?

4) for now we limit ourselves to 8x8 functions, but there are chances we could extend this
to bigger inputs/outputs (eg. 32x32 for CRC32-like functions would be nice)

Thanks for any help!

Regards,

[1] https://en.wikipedia.org/wiki/AVX-512#GFNI
[2] https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=gf2p&expand=2901
[3] if you wonder why not -mavx512f , see section about current issues below
[4]
https://software.intel.com/content/www/us/en/develop/articles/intel-software-development-emulator.html
[5]
https://github.com/aguinet/llvm-project/commit/9ed424cbac0fe3566f801167e2190fad5ad07507#diff-3a29c490bdd8d147d4044818c2da0509R308
[6]
https://rizmediateknologi.blogspot.com/2019/08/the-ice-lake-benchmark-preview-inside_1.html
[7] https://llvm.org/docs/Proposals/VectorizationPlan.html
[8] https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=insert_epi8&expand=3145
[9]
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=extract_epi8&expand=3145,2432