[llvm-dev] Use Galois field New Instructions (GFNI) to combine affine instructions

Mon May 18 11:24:43 PDT 2020

I can tell you that your avx512 issue is that v64i8 gfni instructions also
require avx512bw to be enabled to make v64i8 a supported type. The C
intrinsics handling in the front end know this rule. But since you
generated your own intrinsics you bypassed that.

On Mon, May 18, 2020 at 6:58 AM Adrien Guinet via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> Hello everyone,
>
> On the last couple of days, I have been experimenting with teaching LLVM
> how to combine a
> set of affine instructions into an instruction that uses the GFNI [1]
> AVX512 extension,
> especially GF2P8AFFINEQB [2]. While the general idea seems to work, I have
> some questions
> about my current implementation (see below). FTR, I have named this
> transformation
> AffineCombineExpr (ACE).
>
> Let's first introduce the general idea, which is to transform code like
> this:
>
> static uint8_t rol2(uint8_t v) {
>   return (v<<2)^(v>>6);
> }
> uint8_t func(uint8_t v) {
>   v = rol2(v);
>   v ^= 0xAA;
>   return v;
> }
>
> into this:
>
> define zeroext i8 @func(i8 zeroext %v) local_unnamed_addr #0 {
>   %0 = insertelement <16 x i8> undef, i8 %v, i64 0
>   %1 = call <16 x i8> @llvm.x86.vgf2p8affineqb.128(<16 x i8> %0, <16 x i8>
> bitcast (<2 x
> i64> <i64 4647715923615551520, i64 undef> to <16 x i8>), i8 -86)
>   %2 = extractelement <16 x i8> %1, i64 0
>   ret i8 %2
> }
>
> (if that's profitable, which might not be the case here, see below)
>
> Another more interesting example where we could see potential benefits is
> this one:
>
> https://github.com/aguinet/llvm-project/commit/9ed424cbac0fe3566f801167e2190fad5ad07507#diff-21dd247f3b8aa49860ae8122fe3ea698R22
>
> This gets even more interesting with vectorized code, with an example here:
>
> * original C code: https://pastebin.com/4JjF7DPu
> * LLVM IR after clang -O2 -mgfni -mavx2: https://pastebin.com/Ti0Vm2gj [3]
> * LLVM IR after ACE (using opt -aggressive-instcombine -S):
> https://pastebin.com/2zFU7J6g
> (interesting things happened at line 67)
>
> If, like me, you don't have a GFNI-enabled CPU, you can use Intel SDE [4]
> to run the
> compiled code.
>
> The code of the pass is available here:
>
>
> https://github.com/aguinet/llvm-project/blob/feature/gfni_combine/llvm/lib/Transforms/AggressiveInstCombine/AffineExprCombine.cpp
>
> And there are test cases here:
>
> https://github.com/aguinet/llvm-project/tree/feature/gfni_combine/llvm/test/Transforms/AggressiveInstCombine
> (aec_*.ll)
>
> Questions
> =========
>
> The high-level view of the algorithm is the following:
>
> a) gather, from a basic block, suites of instructions that process an
> 8-bit integer using
> affine instructions, and generate another 8-bit integer.
> b) compute the linear matrix and affine constant related to that set of
> instructions
> c) emit the GFNI instructions
>
> Even thought the current code showcases the idea, there are quite a few
> things I'm unhappy
> with and would like some advice:
>
> 1) about a): what we want is an analysis that can gather, from a given
> basic block, the
> largest DAG*s* of instructions based on a given predicate (in our case: is
> this an affine
> transformation?), that process an 8-bit value and output another 8-bit
> value.
>
> After looking at {Aggressive,}InstCombine, I hadn't find exactly what I
> wanted, so I've
> rewritten something from scratch to validate the overall idea. But is
> there some facility
> within LLVM I could reuse for this purpose? This feels like for instance
> the same kind of
> analyses that might be already done in ScheduleDAG (?).
>
> 2) profitability: according to [6], the latency of the GFNI instructions
> is 3 cycles. and
> in the general case insertelement and extractelement also maps to 3-cycle
> latency
> instructions [8] [9]. This makes the whole replacement a latency of 9
> cycles (in the
> scalar case). Is there any example on how can I compare this 9 cycles
> latency against the
> set of instructions I am replacing against? What I do for now is really
> simple and I think
> far from reality (see
>
> https://github.com/aguinet/llvm-project/commit/9ed424cbac0fe3566f801167e2190fad5ad07507#diff-3a29c490bdd8d147d4044818c2da0509R115
> )
>
> 3) loop vectorization: related to 2), it seems that we could generate more
> efficient code
> if this instruction combining process would happen directly within the
> loop vectorization
> algorithm. Indeed, we could benefit from the cost model analysis that
> already exists
> there, and also tweak the loop unrolling factor to better hide this
> latency of 3 cycles.
>
> From the documentation [7], it looks like this need to happen in the VPlan
> representation,
> but I've had a hard time figuring out where I should plug myself in. Is
> there any example
> that showcases instruction combining within this representation?
>
> 4) I inserted this transformation into AggressiveInstCombine, because it
> is indeed on
> paper a combination of instructions, and the analysis to make it work can
> be quite costly.
> That being said, it seems to be ran too early in the pipeline, and just
> running clang -03
> -mgfni does not combine anything at all (I still got to investigate this).
> It might be
> linked to 1) and the fact that my analysis is very naïve, and assume some
> "clean" and
> optimized LLVM IR.
> Question is: where does that kind of transformation should happen in the
> current
> optimization pipeline?
>
> Current limitations/issues
> ==========================
>
> 1) compile-time performances: I haven't run benchmarks yet to see the
> compile-cost of
> this. Some efforts have been made on this preliminary version to drop as
> early as possible
> basic blocks that does not seem interesting [5], but that deserves more
> work
>
> 2) run-time performances: if someone has a GFNI-enabled CPU and can ran
> some benchmarks
> (for instance on the aec_vec.ll one), that could be very interesting :)
>
> 3) if we run the "vectorization" test above with avx512 (and thus 32-bytes
> vectors), we
> generate the LLVM IR here: https://pastebin.com/Rwn43N4x, but llc crashes
> with this
> message: https://pastebin.com/bbSZPFe5 . Am I doing something wrong or is
> there an actual
> issue in the X86 backend?
>
> 4) for now we limit ourselves to 8x8 functions, but there are chances we
> could extend this
> to bigger inputs/outputs (eg. 32x32 for CRC32-like functions would be nice)
>
> Thanks for any help!
>
> Regards,
>
> [1] https://en.wikipedia.org/wiki/AVX-512#GFNI
> [2]
> https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=gf2p&expand=2901
> [3] if you wonder why not -mavx512f , see section about current issues
> below
> [4]
>
> https://software.intel.com/content/www/us/en/develop/articles/intel-software-development-emulator.html
> [5]
>
> https://github.com/aguinet/llvm-project/commit/9ed424cbac0fe3566f801167e2190fad5ad07507#diff-3a29c490bdd8d147d4044818c2da0509R308
> [6]
>
> https://rizmediateknologi.blogspot.com/2019/08/the-ice-lake-benchmark-preview-inside_1.html
> [7] https://llvm.org/docs/Proposals/VectorizationPlan.html
> [8]
> https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=insert_epi8&expand=3145
> [9]
>
> https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=extract_epi8&expand=3145,2432
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-- 
~Craig
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200518/91f55bba/attachment.html>