<div><div dir="auto">I can tell you that your avx512 issue is that v64i8 gfni instructions also require avx512bw to be enabled to make v64i8 a supported type. The C intrinsics handling in the front end know this rule. But since you generated your own intrinsics you bypassed that.</div></div><div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, May 18, 2020 at 6:58 AM Adrien Guinet via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;padding-left:1ex;border-left-color:rgb(204,204,204)">Hello everyone,<br>

<br>

On the last couple of days, I have been experimenting with teaching LLVM how to combine a<br>

set of affine instructions into an instruction that uses the GFNI [1] AVX512 extension,<br>

especially GF2P8AFFINEQB [2]. While the general idea seems to work, I have some questions<br>

about my current implementation (see below). FTR, I have named this transformation<br>

AffineCombineExpr (ACE).<br>

<br>

Let's first introduce the general idea, which is to transform code like this:<br>

<br>

static uint8_t rol2(uint8_t v) {<br>

  return (v<<2)^(v>>6);<br>

}<br>

uint8_t func(uint8_t v) {<br>

  v = rol2(v);<br>

  v ^= 0xAA;<br>

  return v;<br>

}<br>

<br>

into this:<br>

<br>

define zeroext i8 @func(i8 zeroext %v) local_unnamed_addr #0 {<br>

  %0 = insertelement <16 x i8> undef, i8 %v, i64 0<br>

  %1 = call <16 x i8> @llvm.x86.vgf2p8affineqb.128(<16 x i8> %0, <16 x i8> bitcast (<2 x<br>

i64> <i64 4647715923615551520, i64 undef> to <16 x i8>), i8 -86)<br>

  %2 = extractelement <16 x i8> %1, i64 0<br>

  ret i8 %2<br>

}<br>

<br>

(if that's profitable, which might not be the case here, see below)<br>

<br>

Another more interesting example where we could see potential benefits is this one:<br>

<a href="https://github.com/aguinet/llvm-project/commit/9ed424cbac0fe3566f801167e2190fad5ad07507#diff-21dd247f3b8aa49860ae8122fe3ea698R22" rel="noreferrer" target="_blank">https://github.com/aguinet/llvm-project/commit/9ed424cbac0fe3566f801167e2190fad5ad07507#diff-21dd247f3b8aa49860ae8122fe3ea698R22</a><br>

<br>

This gets even more interesting with vectorized code, with an example here:<br>

<br>

* original C code: <a href="https://pastebin.com/4JjF7DPu" rel="noreferrer" target="_blank">https://pastebin.com/4JjF7DPu</a><br>

* LLVM IR after clang -O2 -mgfni -mavx2: <a href="https://pastebin.com/Ti0Vm2gj" rel="noreferrer" target="_blank">https://pastebin.com/Ti0Vm2gj</a> [3]<br>

* LLVM IR after ACE (using opt -aggressive-instcombine -S): <a href="https://pastebin.com/2zFU7J6g" rel="noreferrer" target="_blank">https://pastebin.com/2zFU7J6g</a><br>

(interesting things happened at line 67)<br>

<br>

If, like me, you don't have a GFNI-enabled CPU, you can use Intel SDE [4] to run the<br>

compiled code.<br>

<br>

The code of the pass is available here:<br>

<br>

<a href="https://github.com/aguinet/llvm-project/blob/feature/gfni_combine/llvm/lib/Transforms/AggressiveInstCombine/AffineExprCombine.cpp" rel="noreferrer" target="_blank">https://github.com/aguinet/llvm-project/blob/feature/gfni_combine/llvm/lib/Transforms/AggressiveInstCombine/AffineExprCombine.cpp</a><br>

<br>

And there are test cases here:<br>

<a href="https://github.com/aguinet/llvm-project/tree/feature/gfni_combine/llvm/test/Transforms/AggressiveInstCombine" rel="noreferrer" target="_blank">https://github.com/aguinet/llvm-project/tree/feature/gfni_combine/llvm/test/Transforms/AggressiveInstCombine</a><br>

(aec_*.ll)<br>

<br>

Questions<br>

=========<br>

<br>

The high-level view of the algorithm is the following:<br>

<br>

a) gather, from a basic block, suites of instructions that process an 8-bit integer using<br>

affine instructions, and generate another 8-bit integer.<br>

b) compute the linear matrix and affine constant related to that set of instructions<br>

c) emit the GFNI instructions<br>

<br>

Even thought the current code showcases the idea, there are quite a few things I'm unhappy<br>

with and would like some advice:<br>

<br>

1) about a): what we want is an analysis that can gather, from a given basic block, the<br>

largest DAG*s* of instructions based on a given predicate (in our case: is this an affine<br>

transformation?), that process an 8-bit value and output another 8-bit value.<br>

<br>

After looking at {Aggressive,}InstCombine, I hadn't find exactly what I wanted, so I've<br>

rewritten something from scratch to validate the overall idea. But is there some facility<br>

within LLVM I could reuse for this purpose? This feels like for instance the same kind of<br>

analyses that might be already done in ScheduleDAG (?).<br>

<br>

2) profitability: according to [6], the latency of the GFNI instructions is 3 cycles. and<br>

in the general case insertelement and extractelement also maps to 3-cycle latency<br>

instructions [8] [9]. This makes the whole replacement a latency of 9 cycles (in the<br>

scalar case). Is there any example on how can I compare this 9 cycles latency against the<br>

set of instructions I am replacing against? What I do for now is really simple and I think<br>

far from reality (see<br>

<a href="https://github.com/aguinet/llvm-project/commit/9ed424cbac0fe3566f801167e2190fad5ad07507#diff-3a29c490bdd8d147d4044818c2da0509R115" rel="noreferrer" target="_blank">https://github.com/aguinet/llvm-project/commit/9ed424cbac0fe3566f801167e2190fad5ad07507#diff-3a29c490bdd8d147d4044818c2da0509R115</a>)<br>

<br>

3) loop vectorization: related to 2), it seems that we could generate more efficient code<br>

if this instruction combining process would happen directly within the loop vectorization<br>

algorithm. Indeed, we could benefit from the cost model analysis that already exists<br>

there, and also tweak the loop unrolling factor to better hide this latency of 3 cycles.<br>

<br>

>From the documentation [7], it looks like this need to happen in the VPlan representation,<br>

but I've had a hard time figuring out where I should plug myself in. Is there any example<br>

that showcases instruction combining within this representation?<br>

<br>

4) I inserted this transformation into AggressiveInstCombine, because it is indeed on<br>

paper a combination of instructions, and the analysis to make it work can be quite costly.<br>

That being said, it seems to be ran too early in the pipeline, and just running clang -03<br>

-mgfni does not combine anything at all (I still got to investigate this). It might be<br>

linked to 1) and the fact that my analysis is very naïve, and assume some "clean" and<br>

optimized LLVM IR.<br>

Question is: where does that kind of transformation should happen in the current<br>

optimization pipeline?<br>

<br>

Current limitations/issues<br>

==========================<br>

<br>

1) compile-time performances: I haven't run benchmarks yet to see the compile-cost of<br>

this. Some efforts have been made on this preliminary version to drop as early as possible<br>

basic blocks that does not seem interesting [5], but that deserves more work<br>

<br>

2) run-time performances: if someone has a GFNI-enabled CPU and can ran some benchmarks<br>

(for instance on the aec_vec.ll one), that could be very interesting :)<br>

<br>

3) if we run the "vectorization" test above with avx512 (and thus 32-bytes vectors), we<br>

generate the LLVM IR here: <a href="https://pastebin.com/Rwn43N4x" rel="noreferrer" target="_blank">https://pastebin.com/Rwn43N4x</a>, but llc crashes with this<br>

message: <a href="https://pastebin.com/bbSZPFe5" rel="noreferrer" target="_blank">https://pastebin.com/bbSZPFe5</a> . Am I doing something wrong or is there an actual<br>

issue in the X86 backend?<br>

<br>

4) for now we limit ourselves to 8x8 functions, but there are chances we could extend this<br>

to bigger inputs/outputs (eg. 32x32 for CRC32-like functions would be nice)<br>

<br>

Thanks for any help!<br>

<br>

Regards,<br>

<br>

[1] <a href="https://en.wikipedia.org/wiki/AVX-512#GFNI" rel="noreferrer" target="_blank">https://en.wikipedia.org/wiki/AVX-512#GFNI</a><br>

[2] <a href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=gf2p&expand=2901" rel="noreferrer" target="_blank">https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=gf2p&expand=2901</a><br>

[3] if you wonder why not -mavx512f , see section about current issues below<br>

[4]<br>

<a href="https://software.intel.com/content/www/us/en/develop/articles/intel-software-development-emulator.html" rel="noreferrer" target="_blank">https://software.intel.com/content/www/us/en/develop/articles/intel-software-development-emulator.html</a><br>

[5]<br>

<a href="https://github.com/aguinet/llvm-project/commit/9ed424cbac0fe3566f801167e2190fad5ad07507#diff-3a29c490bdd8d147d4044818c2da0509R308" rel="noreferrer" target="_blank">https://github.com/aguinet/llvm-project/commit/9ed424cbac0fe3566f801167e2190fad5ad07507#diff-3a29c490bdd8d147d4044818c2da0509R308</a><br>

[6]<br>

<a href="https://rizmediateknologi.blogspot.com/2019/08/the-ice-lake-benchmark-preview-inside_1.html" rel="noreferrer" target="_blank">https://rizmediateknologi.blogspot.com/2019/08/the-ice-lake-benchmark-preview-inside_1.html</a><br>

[7] <a href="https://llvm.org/docs/Proposals/VectorizationPlan.html" rel="noreferrer" target="_blank">https://llvm.org/docs/Proposals/VectorizationPlan.html</a><br>

[8] <a href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=insert_epi8&expand=3145" rel="noreferrer" target="_blank">https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=insert_epi8&expand=3145</a><br>

[9]<br>

<a href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=extract_epi8&expand=3145,2432" rel="noreferrer" target="_blank">https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=extract_epi8&expand=3145,2432</a><br>

_______________________________________________<br>

LLVM Developers mailing list<br>

<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>

<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>

</blockquote></div></div>-- <br><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature">~Craig</div>