[LLVMdev] RFC: [AArtch64] Removal of spurious mask instructions introduced during legalization

Tue Jul 1 15:33:23 PDT 2014

Hi Louis,

This is interesting. Moreover, it reminds me of the problem that I addressed in PPCTargetLowering::DAGCombineExtBoolTrunc and PPCTargetLowering::DAGCombineTruncBoolExt for i1 (and i32) <-> i64 extensions in the PPC backend. I think you could apply a similar methodology to finding larger clusters of instructions that can benefit from this kind of transformation.

 -Hal

----- Original Message -----
> From: "Louis Gerbarg" <lgg at apple.com>
> To: llvmdev at cs.uiuc.edu
> Sent: Tuesday, July 1, 2014 5:11:11 PM
> Subject: [LLVMdev] RFC: [AArtch64] Removal of spurious mask instructions introduced during legalization
> 
> 
> 
> We have seen a series of performance issues where we are emitting
> unnecessary masks on ARM64 that all have a similar form. Here is an
> example:
> 
> define zeroext i1 @test(i8 zeroext %theChar) align 2 {
> entry:
> %theChar.off = add i8 %theChar, -97
> %0 = icmp ult i8 %theChar.off, 26
> br i1 %0, label %return, label %lor.lhs.false
> 
> lor.lhs.false: ; preds = %entry
> ret i1 false
> 
> return: ; preds = %entry
> ret i1 true
> }
> 
> compiles to:
> 
> .globl _test
> .align 2
> _test: ; @test
> .cfi_startproc
> ; BB#0: ; %entry
> sub w8, w0, #97 ; =97
> and w8, w8, #0xff
> cmp w8, #26 ; =26
> cset w0, lo
> ret
> .cfi_endproc
> 
> but the “and w8, w8, 0xff” is superfluous and could be removed,
> yielding:
> 
> .globl _test
> .align 2
> _test: ; @test
> .cfi_startproc
> ; BB#0: ; %entry
> sub w8, w0, #97 ; =97
> cmp w8, #26 ; =26
> cset w0, lo
> ret
> .cfi_endproc
> 
> The AND is just a single extra instruction, but it is an additional
> data dependency and it sometimes shows up in the middle of things
> like string processing loops and several hot loops in SPEC, so we
> would really like to fix it.
> 
> After looking into it I have determined what is going on is that when
> an i8 or i16 is legalized up to an i32 the extra masks are
> introduced at intermediary steps of the calculation to preserve the
> same overflow behavior that would have occurred if there were native
> 8 or 16 bit data types on the processor. In some cases that is
> necessary, but in many cases (such as the one above) based on the
> type of compares involved and the provenance of the inputs it is
> possible to know that behavior is the same regardless of whether or
> not the AND instruction is there, and thus it is safe to remove it.
> 
> While at first glance it seems like it would be reasonable to handle
> this with a few DAG patterns in the TableGen files it turns out that
> is a less than satisfactory approach. Whether the AND is removable
> is constrained by the values of 2-3 of the inputs and the compare,
> which ends up with a combinatorial explosion of patterns, especially
> once one considers the inputs values may come from slightly
> different nodes such AssertZexts or extending loads (which we need
> to capture since we need to know the intended extension semantics).
> Attached is a file (AndDAG.pdf) that shows what I believe to be a
> canonical form of the sub-DAG we are looking at, the following
> details may vary:
> 
> 1) The orange subtree code be a different form of extension, such a
> load<LD1[%arrayidx], anyext from i8> instead of an AssertZext
> 2) The blue and purple elements could be different constants
> 3) The green element (which is the condition code) could be different
> 4) It could all be feeding a different instruction (in particular an
> AArch64ISD::BRCOND) at the end.
> 5) The inputs code be i16 instead i8
> 
> Rather than try respond reactively to individual cases we see in bugs
> I would like to solve the entire problem. To that end I propose
> introducing an architecture specific DAG combine that recognizes
> this basic tree, and performs the and elimination if legal.
> Recognizing the DAG and recording the value is fairly straight
> forward, but determining when the and is legal is somewhat more
> complicated. Each case has taken me a few minutes, and that is just
> looking at constant values.
> 
> In order to speed of the process I wrote a small command line tool
> that runs over all inputs for both assembly patterns and generates 2
> dimensional charts of the cases where the code is the same with and
> without the mask. Using these charts it is possible to write a
> series of equations to cover all the legal cases, not unlike a
> Carnaugh map. Since these equations need to work for inputs that are
> both i8 and i16 they have to be written symbolically in terms of the
> input types (max representable, min representable, positive,
> negative, 0, etc) in order to be generally applicable. Using that
> fact it is possible to reduce the test case down to i4’s using 0x0f,
> making the tables much more tractable. The tables run from -8 to +15
> in both dimensions expressing all the 32 bit representations of zero
> and sign extended i4 values. I am attaching an example chart
> (ZeroExtendedLO.png), but I actually generate 24 of them (the al,
> nv, vc and vs condition codes are not relevant leaving us with 12
> codes, and then we need 2 charts for each, one when the input is
> sign extended and when when it is zero extended).
> 
> Fox example, a set of covering equations for a zero extended variable
> using for the LO cond code is:
> 
> (AddConstant >= 0 && SubsConstant <= 0)
> || (AddConstant <= 0 && SubsConstant >= 0 && SubsConstant <= int_max)
> || (Subs2 >= Subs2 + uint_max))
> 
> Which would result in an equivalent chart to the one attached when
> using 4 bit integers, but representing it symbolically it also
> expands to 8 and 16 bit integers.
> 
> Attached is a WIP patch that I have made. It is incomplete in several
> ways, but I believe it should be enough to discuss whether or not
> this is reasonable path to pursue. In particular it:
> 
> 1) It is not quite LLVM coding style
> 2) It only works against a AssertZext input
> 3) It only has equations for LO
> 4) It only matches on CSEL, not BRCOND et al
> 6) It pass the CC around as an unsigned
> 7) It makes also several simplifying assumptions that I believe are
> true but need to confirm:
> 1) By the time the DAG gets to me it has been canonicalized such that
> it is of the form:
> (AArch64ISD::SUBS (and (+ variable constant1) mask) constant2)
> 2) The mask is a 2^8-1 or 2^16-1 (though it is possible to extend
> this to arbitrary powers of 2 it doesn’t
> seem like any fronted would generate those)
> 
> If there is consensus this is a reasonable path to pursue then my
> intent is to clean up the patch fill in the other sets of equations,
> modifying the script I used to generate the tables to output JSON or
> something similar, and write a validator to ensure the equations
> 100% accurately match the generated output, then choose a suitable
> subset of test cases to include with the patch.
> 
> Louis
> 
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory