[llvm-dev] [cfe-dev] CFG simplification question, and preservation of branching in the original code

Craig Topper via llvm-dev llvm-dev at lists.llvm.org
Mon Sep 30 14:31:30 PDT 2019


For the MSP430 example, I'm guess its InstCombiner::transformSExtICmp
or InstCombiner::transformZExtICmp

~Craig


On Mon, Sep 30, 2019 at 2:21 PM Support IMAP <support at sweetwilliamsl.com>
wrote:

> Hi all,
>
> Ok, I just found a much simpler example of the same issue.
>
> Consider the following code
>
> int cmpge32_0(long a) {
>   return a>=0;
> }
>
> Compiled for the MSP430 with -O1 or -Os results in the following:
>
> ; Function Attrs: norecurse nounwind readnone
> define dso_local i16 @cmpge32_0(i32 %a) local_unnamed_addr #0 {
> entry:
>   %a.lobit = lshr i32 %a, 31
>   %0 = trunc i32 %a.lobit to i16
>   %.not = xor i16 %0, 1
>   ret i16 %.not
> }
>
> The backend then turns this into the following totally suboptimal code:
>
> cmpge32_0:
> mov r13, r12
> inv r12
> swpb r12
> mov.b r12, r12
> clrc
> rrc r12
> rra r12
> rra r12
> rra r12
> rra r12
> rra r12
> rra r12
> ret
> .Lfunc_end0:
> .size cmpge32_0, .Lfunc_end0-cmpge32_0
>
>
> The cause of this anomaly is again the presence of the Shift instruction (%a.lobit
> = lshr i32 %a, 31) at the IR level, which is hard to handle by the
> backend.
>
> The same C code compiled with -O0 creates the following IR code excerpt
> instead of the lshr-trunc code sequence
>
>   %cmp = icmp sge i32 %0, 0
>   %conv = zext i1 %cmp to i16
>
> This compiles into MUCH better code for the MSP430 architecture (and
> virtually any 16 bit architecture not supporting multiple shifts).
> It would be desirable that LLVM would just leave the comparison as is,
> also for  -O1 and above.
>
>
> So Please, can somebody point me to the LLVM class or function that
> performs the transformation of the comparison above into the undesired
> shift, so I can investigate what’s going on, or whether there’s something I
> can do?
>
> That would be really appreciated.
>
> John
>
>
> Hi Roman
>
> Not exactly, this is IR after optimizations, this is not what clang
> produces.
> To see what clang produces you want to pass -O0.
> All optimizations beyond that are done by LLVM.
>
>
>
> Ok, I understand such naming convention, but it is still something that
> happens at the IR code generation steps, and therefore the backend has
> little to do about.
>
> So, what are actually the hooks that I can implement or investigate to
> modify the undesired behaviour?
>
> John
>
>
> On 30 Sep 2019, at 13:35, Roman Lebedev <lebedev.ri at gmail.com> wrote:
>
> On Mon, Sep 30, 2019 at 11:52 AM Joan Lluch <joan.lluch at icloud.com> wrote:
>
>
> Hi Roman,
>
> Is "test" actually an implementation of a 64-bit-wide multiplication
> compiler-rt builtin?
> Then i'd think the main problem is that it is being optimized in the
> first place, you could end up with endless recursion…
>
>
> No, this is not a compiler-rt builtin. My example is of course
> incidentally taken from the implementation of a signed multiply, but as
> said, it has nothing to do with rt-builtins, I'm just using that code to
> show the issue. This function can’t create a recursion because it’s named
> ’test’, unlike any rt-buitin. You can replace the multiply in the source
> code by an addition, if you want to avoid calling rt-functions, but this
> does not change what I attempt to show. Also It’s not meant to be 64 bit
> wide, but 32 bit wide, because the targets I’m testing are 16 bit, so ints
> are 16 bit and longs are 32 bit. This is again the function I am testing:
>
>
> long test (long a, long b)
> {
> int neg = 0;
> long res;
>
> if (a < 0)
> {
>   a = -a;
>   neg = 1;
> }
>
> if (b < 0)
> {
>   b = -b;
>   neg = !neg;
> }
>
> res = a*b;
>
> if (neg)
>   res = -res;
>
> return res;
> }
>
>
>
> LLVM, not clang.
>
>
> I’m not sure about what you mean by that. The shown LLVM IR code is
> created by executing "clang” command line, so that’s what I attempt to show.
>
> Not exactly, this is IR after optimizations, this is not what clang
> produces.
> To see what clang produces you want to pass -O0.
> All optimizations beyond that are done by LLVM.
>
> So it’s actually the front-end that does such undesired optimisations
> sometimes, not only the LLVM back-end. This is in part why I am saying this
> is not right. See copied again the IR code that gets generated for the C
> code that I posted before. This IR code, including the presence of
> expensive shifts ( %a.lobit = lshr i32 %a, 31)  is generated when -mllvm
> -phi-node-folding-threshold=1 is specified in the command line, or when the
> Target implements getOperationCost(unsigned Opcode, Type *Ty, Type *OpTy)
> to return TCC_Expensive for operator types that are bigger than the default
> target register size.
>
>
>
> ; ModuleID = 'main.c'
> source_filename = "main.c"
> target datalayout =
> "e-m:e-p:16:16-i32:16-i64:16-f32:16-f64:16-a:8-n8:16-S16"
> target triple = "msp430"
>
> ; Function Attrs: norecurse nounwind optsize readnone
> define dso_local i32 @test(i32 %a, i32 %b) local_unnamed_addr #0 {
> entry:
> %cmp = icmp slt i32 %a, 0
> %sub = sub nsw i32 0, %a
> %spec.select = select i1 %cmp, i32 %sub, i32 %a
> %a.lobit = lshr i32 %a, 31
> %0 = trunc i32 %a.lobit to i16
> %cmp1 = icmp slt i32 %b, 0
> br i1 %cmp1, label %if.then2, label %if.end4
>
> if.then2:                                         ; preds = %entry
> %sub3 = sub nsw i32 0, %b
> %1 = xor i16 %0, 1
> br label %if.end4
>
> if.end4:                                          ; preds = %if.then2,
> %entry
> %b.addr.0 = phi i32 [ %sub3, %if.then2 ], [ %b, %entry ]
> %neg.1 = phi i16 [ %1, %if.then2 ], [ %0, %entry ]
> %mul = mul nsw i32 %b.addr.0, %spec.select
> %tobool5 = icmp eq i16 %neg.1, 0
> %sub7 = sub nsw i32 0, %mul
> %spec.select18 = select i1 %tobool5, i32 %mul, i32 %sub7
> ret i32 %spec.select18
> }
>
> attributes #0 = { norecurse nounwind optsize readnone
> "correctly-rounded-divide-sqrt-fp-math"="false"
> "disable-tail-calls"="false" "less-precise-fpmad"="false"
> "min-legal-vector-width"="0" "no-frame-pointer-elim"="false"
> "no-infs-fp-math"="false" "no-jump-tables"="false"
> "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false"
> "no-trapping-math"="false" "stack-protector-buffer-size"="8"
> "unsafe-fp-math"="false" "use-soft-float"="false" }
>
> !llvm.module.flags = !{!0}
> !llvm.ident = !{!1}
>
> !0 = !{i32 1, !"wchar_size", i32 2}
> !1 = !{!"clang version 9.0.0 (https://github.com/llvm/llvm-project.git
> 6f7deba43dd25fb7b3eca70f9c388ec9174f455a)"}
>
>
>
> As you can see, Clang produces a 31 bit wide shift right ( %a.lobit = lshr
> i32 %a, 31) That’s the fourth instruction on the IR code above. So a shift
> is produced instead of creating a jump to a new block, as it should be the
> case as per the C source code.
>
> Just as a matter of information. This is the implementation of the
> getOperationCost function that causes ‘clang’ to correctly replace selects
> by branches (desirable), but to generate shifts to fold expensive selects
> (undesirable)
>
>
> unsigned CPU74TTIImpl::getOperationCost(unsigned Opcode, Type *Ty, Type
> *OpTy)
> {
> // Big types are expensive
> unsigned OpSize = Ty->getScalarSizeInBits();
> if ( OpSize > 16 )
>   return TTI::TCC_Expensive;
>
> return BaseT::getOperationCost(Opcode, Ty, OpTy);
> }
>
> If the getOperationCost above function was not implemented, then clang
> would generate the usual series of ‘selects’. But this is even worse
> because selects imply speculative execution of expensive instructions, or
> duplicate branching created by the backend, which can’t be easily avoided.
>
> Ideally, the IR code above should just place an ‘if.then' block for the if
> (a < 0) statement in the C source code, instead of attempting to replace a
> select by a shift (!)
>
> If you want to play with these two scenarios, (1) IR code generated with
> branches, and (2) IR code generated with selects. This can easily be
> reproduced for the MSP430 target by compiling with the following options
> (1)  -mllvm -phi-node-folding-threshold=1 -c -S -Os
> (2)  -mllvm -phi-node-folding-threshold=2 -c -S -Os
>
> For 16 bit targets without selects, or expensive selects, the overall code
> is better with (1) because that prevents the creation of a different jump
> for every ‘select’ that (2) would cause. However, the presence of the
> ‘shift’ instruction for (1) spoils it all.
>
> Again, ideally, the use of shifts as a replacement of selects should be
> avoided, and an “if.then" block should be used as per the original C code.
>
> I hope this is clear now.
>
> John.
>
>
> On 29 Sep 2019, at 15:57, Roman Lebedev <lebedev.ri at gmail.com> wrote:
>
> On Sun, Sep 29, 2019 at 3:35 PM Joan Lluch via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
>
>
> Hi Sanjay,
>
> Actually, the CodeGenPrepare::optimizeSelectInst is not doing the best it
> could do in some circumstances: The case of “OptSize" for targets not
> supporting Select was already mentioned to be detrimental.
>
> For targets that actually have selects, but branches are cheap and
> generally profitable, particularly for expensive operators, the
> optimizeSelectInst function does not do good either. The function tries to
> identify consecutive selects with the same condition in order to avoid
> duplicate branches, which is ok, but then this effort is discarded in
> isFormingBranchFromSelectProfitable because the identified condition is
> used more than once (on the said two consecutive selects, of course), which
> defeats the whole purpose of checking for them, resulting in poor codegen.
>
> Yet another issue is that Clang attempts to replace ‘selects’ in the
> source code, by supposedly optimised code that is not ok for all targets.
> One example is this:
>
> LLVM, not clang.
>
> long test (long a, long b)
> {
> int neg = 0;
> long res;
>
> if (a < 0)
> {
>  a = -a;
>  neg = 1;
> }
>
> if (b < 0)
> {
>  b = -b;
>  neg = !neg;
> }
>
> res = a*b; //(unsigned long)a / (unsigned long)b;  // will call __udivsi3
>
> if (neg)
>  res = -res;
>
> return res;
> }
>
>
> This gets compiled into
>
> ; Function Attrs: norecurse nounwind readnone
> define dso_local i32 @test(i32 %a, i32 %b) local_unnamed_addr #0 {
> entry:
> %cmp = icmp slt i32 %a, 0
> %sub = sub nsw i32 0, %a
> %a.addr.0 = select i1 %cmp, i32 %sub, i32 %a
> %a.lobit = lshr i32 %a, 31
> %0 = trunc i32 %a.lobit to i16
> %cmp1 = icmp slt i32 %b, 0
> br i1 %cmp1, label %if.then2, label %if.end4
>
> if.then2:                                         ; preds = %entry
> %sub3 = sub nsw i32 0, %b
> %1 = xor i16 %0, 1
> br label %if.end4
>
> if.end4:                                          ; preds = %if.then2,
> %entry
> %b.addr.0 = phi i32 [ %sub3, %if.then2 ], [ %b, %entry ]
> %neg.1 = phi i16 [ %1, %if.then2 ], [ %0, %entry ]
> %mul = mul nsw i32 %b.addr.0, %a.addr.0
> %tobool5 = icmp eq i16 %neg.1, 0
> %sub7 = sub nsw i32 0, %mul
> %res.0 = select i1 %tobool5, i32 %mul, i32 %sub7
> ret i32 %res.0
> }
>
> The offending part here is this:  %a.lobit = lshr i32 %a, 31 . Instead of
> just creating a “select” instruction, as the original code suggested with
> the if (a < 0) { neg = 1;} statements, the front-end produces a lshr which
> is very expensive for small architectures, and makes it very difficult for
> the backend to fold it again into an actual select (or branch). In my
> opinion, the original C code should have produced a “select” and give the
> backend the opportunity to optimise it if required. I think that the
> frontend should perform only target independent optimisations.
>
>
> You didn't specify how you compile that code.
> We could also get: https://godbolt.org/z/B-5lj1
> Which can actually be folded further to just
> long test(long a, long b) {
>  return a * b;
> }
> Is "test" actually an implementation of a 64-bit-wide multiplication
> compiler-rt builtin?
> Then i'd think the main problem is that it is being optimized in the
> first place, you could end up with endless recursion...
>
> I posted before my view that LLVM is clearly designed to satisfy big boys
> such as the x86 and ARM targets. This means that, unfortunately, it makes
> too many general assumptions about what’s cheap, without providing enough
> hooks to cancel arbitrary optimisations. As I am implementing backends for
> 8 or 16 bit targets, I find myself doing a lot of work just to reverse
> optimisations that should have not been applied in the first place. My
> example above is an instance of a code mutation performed by the frontend
> that is not desirable. Existing 8 and 16 bit trunk targets (particularly
> the MSP430 and the AVR) are also negatively affected by the excessively
> liberal use of shifts by LLVM.
>
> The CodeGenPrepare::optimizeSelectInst function needs some changes to
> respect targets with no selects, and targets that may want to avoid
> expensive speculative executions.
>
> John
>
> Roman
>
> On 25 Sep 2019, at 16:00, Sanjay Patel <spatel at rotateright.com> wrote:
>
> Changing the order of the checks in CodeGenPrepare::optimizeSelectInst()
> sounds good to me.
>
> But you may need to go further for optimum performance. For example, we
> may be canonicalizing math/logic IR patterns into 'select' such as in the
> recent:
> https://reviews.llvm.org/D67799
>
> So if you want those to become ALU ops again rather than branches, then
> you need to do the transform later in the backend. That is, you want to let
> DAGCombiner run its set of transforms on 'select' nodes.
>
> On Wed, Sep 25, 2019 at 4:03 AM Joan Lluch via cfe-dev <
> cfe-dev at lists.llvm.org> wrote:
>
>
> Hi Craig,
>
> Thank you for your reply. I have started looking at “CodeGenPrepare” and I
> assume you reffer to CodeGenPrepare::optimizeSelectInst. I will try to play
> a bit with that possibly later today. At first glance, it looks to me that
> for targets that do not support ’select’ at all, the fact that the function
> exits early for ‘OptSize’ can be detrimental, because this will just leave
> ALL existing selects in the code anyway. As said, I will try to play with
> that later, but right now it looks to me that maybe we should check  for
> TLI->isSelectSupported earlier in the function, to get some more
> opportunities to such targets without explicit ’select’ support?
>
> Thanks
>
> John
>
>
> On 25 Sep 2019, at 08:59, Craig Topper <craig.topper at gmail.com> wrote:
>
> There is code in CodeGenPrepare.cpp that can turn selects into branches
> that tries to account for multiple selects sharing the same condition. It
> doesn't look like either AVR or MSP430 enable that code though.
>
> ~Craig
>
>
> On Tue, Sep 24, 2019 at 11:27 PM Joan Lluch via cfe-dev <
> cfe-dev at lists.llvm.org> wrote:
>
>
> Hi Roman,
>
> Thank you for your reply. I understand your point. I just want to add
> something to clarify my original post in relation to your reply.
>
> There are already implemented 8-bit and 16-bit backends, namely the AVR
> and the MSP430, which already "aggressively convert selects into branches”,
> which already benefit (as they are) from setting
> "phi-node-folding-threshold’ to 1 or zero. This is because otherwise Clang
> will generate several selects depending on the same “icmp”. These backends
> are unable to optimise that, and they just create a comparison and a
> conditional branch for every “select” in the IR code, in spite that the
> original C code was already written in a much better way. So the resulting
> effect is the presence of redundant comparisons and branches in the final
> code, with a detrimental of generated code quality.
>
> The above gets improved by setting "phi-node-folding-threshold’ to 1
> because some of these extra ‘selects' are no longer there so the backend
> stops generating redundant code.
>
> John.
>
>
>
>
> On 21 Sep 2019, at 14:48, Roman Lebedev <lebedev.ri at gmail.com> wrote:
>
> On Sat, Sep 21, 2019 at 3:18 PM Joan Lluch via cfe-dev
> <cfe-dev at lists.llvm.org> wrote:
>
>
> Hi all,
>
> For my custom architecture, I want to relax the CFG simplification pass,
> and any other passes replacing conditional branches.
>
> I found that the replacement of conditional branches by “select" and other
> instructions is often too aggressive, and this causes inefficient code for
> my target as in most cases branches would be cheaper.
>
> For example, considering the following c code:
>
> long test (long a, long b)
> {
> int neg = 0;
> long res;
>
> if (a < 0)
> {
> a = -a;
> neg = 1;
> }
>
> res = a*b;
>
> if (neg)
> res = -res;
>
> return res;
> }
>
>
> This code can be simplified in c, but it’s just an example to show the
> point.
>
> The code above gets compiled like this (-Oz flag):
>
> ; Function Attrs: minsize norecurse nounwind optsize readnone
> define dso_local i32 @test(i32 %a, i32 %b) local_unnamed_addr #0 {
> entry:
> %cmp = icmp slt i32 %a, 0
> %sub = sub nsw i32 0, %a
> %a.addr.0 = select i1 %cmp, i32 %sub, i32 %a
> %mul = mul nsw i32 %a.addr.0, %b
> %sub2 = sub nsw i32 0, %mul
> %res.0 = select i1 %cmp, i32 %sub2, i32 %mul
> ret i32 %res.0
> }
>
>
> All branching was removed and replaced by ‘select’ instructions. For my
> architecture, it would be desirable to keep the original branches in most
> cases, because even simple 32 bit operations are too expensive to
> speculatively execute them, and branches are cheap.
>
> Setting  'phi-node-folding-threshold’ to 1 or even 0 (instead of the
> default 2), definitely improves the situation in many cases, but Clang
> still creates many instances of ‘select’ instructions, which are
> detrimental to my target. I am unsure about where are they created, as I
> believe that the simplifycfg pass does not longer create them.
>
> You definitively can't ban llvm passes/clang from creating select's.
>
> So the question is: Are there any other hooks in clang, or custom code
> that I can implement, to relax the creation of ’select’ instructions and
> make it preserve branches in the original c code?
>
> I think this is backwards.
> Sure, you could maybe disable most of the folds that produce selects.
> That may be good for final codegen, but will also affect other passes
> since not everything deals with 2-node PHI as good as wit selects.
>
> But, what happens if you still get the select-y IR?
> Doesn't matter how, could be hand-written.
>
> I think you might want to instead aggressively convert selects into
> branches in backend.
>
> Thanks,
>
> John
>
> Roman
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
>
>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190930/0047360e/attachment.html>


More information about the llvm-dev mailing list