[llvm-dev] [cfe-dev] CFG simplification question, and preservation of branching in the original code

Sanjay Patel via llvm-dev llvm-dev at lists.llvm.org
Wed Oct 2 05:34:24 PDT 2019


On Tue, Oct 1, 2019 at 5:51 PM Joan Lluch <joan.lluch at icloud.com> wrote:

> In order to give better support to current and future implementations of
> small processor targets, I wonder if instead of attempting to implement a
> general solution, we could implement a set of compiler flags (or code
> hooks) that targets could use to optionally disable undesired IR optimiser
> actions, however not affecting anything of the current behaviour as long as
> these flags are not used. If we had that, nothing would need to be done on
> the existing targets, particularly the major ones, and yet, new backend
> developments and small processor (AVR, MSP430) could potentially benefit
> from that. My opinion is that the availability of such options will not
> defeat the “almost entirely target-independent” nature of the IR optimiser,
> as the transformations would still be target-agnostic, except that targets
> would be able to decide which ones are more convenient for them, or disable
> the non desirable ones. Does this sound ok or feasible?.
>

Providing target options/overrides to code that is supposed to be
target-independent sounds self-defeating to me. I doubt that proposal would
gain much support.
Of course, if you're customizing LLVM for your own out-of-trunk backend,
you can do anything you'd like if you're willing to maintain the custom
patches.
I suggest that you file a bug report showing an exact problem caused by 1
of the icmp/select transforms.
If it can be done using an in-trunk backend like AVR or MSP430, then we'll
have a real example showing the harm and hopefully better ideas about how
to solve the bug.


On Tue, Oct 1, 2019 at 5:51 PM Joan Lluch <joan.lluch at icloud.com> wrote:

> Hi Sanjay,
>
> Thanks for your reply.
>
> So yes, the IR optimizer (instcombine is the specific pass) sometimes
> turns icmp (and select) sequences into ALU ops. Instcombine is almost
> entirely *target-independent* and should remain that way. The (sometimes
> unfortunate) decision to create shifts were made based on popular targets
> of the time (PowerPC and/or x86), and other targets may have suffered
> because of that.
>
>
> Yes, that’s what I actually found that I didn’t anticipate.
>
> We've been trying to reverse those canonicalizations in IR over the past
> few years when the choice is clearly not always optimal, but it's not easy.
> To avoid perf regressions, you first have to make the backend/codegen aware
> of the alternate pattern that includes icmp/select and transform that to
> math/logic (translate instcombine code to DAGCombiner). Then, you have to
> remove the transform from instcombine and replace it with the reverse
> transform. This can uncover infinite loops and other problems within
> instcombine.
>
>
> I understand that. In any case, I am glad that at least this is
> acknowledged as some kind of flaw of the LLVM system, particularly for the
> optimal implementation of small processor backends. As these targets
> generally have cheap branches and do not generally have selects or
> multi-bit shifts, they hardly benefit from transformations involving shifts
> or aggressively attempting to replace jumps.
>
> So to finally answer the question: If you can transform the shift into an
> alternate sequence with a "setcc" DAG node in your target's "ISelLowering"
> code, that's the easiest way forward. Otherwise, you have to weigh the
> impact of each target-independent transform on every target.
>
>
> Yes, that’s what I have been doing all the time. My backend contains a lot
> of code only to reverse undesired LLVM transformations, and yet the
> resulting final assembly is not as good as it could be, because it is often
> hard or impossible to identify all sources of improvement. It’s ironical
> that some older, less sophisticated compilers (and GCC) produce /much/
> better code than LLVM for simple architectures.
>
> In order to give better support to current and future implementations of
> small processor targets, I wonder if instead of attempting to implement a
> general solution, we could implement a set of compiler flags (or code
> hooks) that targets could use to optionally disable undesired IR optimiser
> actions, however not affecting anything of the current behaviour as long as
> these flags are not used. If we had that, nothing would need to be done on
> the existing targets, particularly the major ones, and yet, new backend
> developments and small processor (AVR, MSP430) could potentially benefit
> from that. My opinion is that the availability of such options will not
> defeat the “almost entirely target-independent” nature of the IR optimiser,
> as the transformations would still be target-agnostic, except that targets
> would be able to decide which ones are more convenient for them, or disable
> the non desirable ones. Does this sound ok or feasible?.
>
> John
>
>
>
> On 1 Oct 2019, at 17:20, Sanjay Patel <spatel at rotateright.com> wrote:
>
> First, let's agree on terminology:
> 1. We're in LLVM. Clang has little or nothing to do with these questions
> from the perspective of LLVM developers.
> 2. The IR optimizer (also known as the middle-end and invoked via 'opt')
> is what takes LLVM IR from a front-end (clang is just 1 example) and
> transforms it to different LLVM IR for easier target-specific optimization.
> 3. The back-end (invoked using 'llc') is what takes LLVM IR and turns it
> into assembly for your target. Codegen is 1 part of the backend.
>
> So yes, the IR optimizer (instcombine is the specific pass) sometimes
> turns icmp (and select) sequences into ALU ops. Instcombine is almost
> entirely *target-independent* and should remain that way. The (sometimes
> unfortunate) decision to create shifts were made based on popular targets
> of the time (PowerPC and/or x86), and other targets may have suffered
> because of that.
>
> We've been trying to reverse those canonicalizations in IR over the past
> few years when the choice is clearly not always optimal, but it's not easy.
> To avoid perf regressions, you first have to make the backend/codegen aware
> of the alternate pattern that includes icmp/select and transform that to
> math/logic (translate instcombine code to DAGCombiner). Then, you have to
> remove the transform from instcombine and replace it with the reverse
> transform. This can uncover infinite loops and other problems within
> instcombine.
>
> It's often not clear which form of IR will lead to better optimizations
> within the IR optimizer itself. We favor the shortest IR sequence in most
> cases. But if there's a tie, we have to make a judgement about what is
> easier to analyze/transform when viewed within a longer sequence of IR.
>
> So to finally answer the question: If you can transform the shift into an
> alternate sequence with a "setcc" DAG node in your target's "ISelLowering"
> code, that's the easiest way forward. Otherwise, you have to weigh the
> impact of each target-independent transform on every target.
>
>
> On Mon, Sep 30, 2019 at 5:31 PM Craig Topper <craig.topper at gmail.com>
> wrote:
>
>> For the MSP430 example, I'm guess its InstCombiner::transformSExtICmp
>> or InstCombiner::transformZExtICmp
>>
>> ~Craig
>>
>>
>> On Mon, Sep 30, 2019 at 2:21 PM Support IMAP <support at sweetwilliamsl.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> Ok, I just found a much simpler example of the same issue.
>>>
>>> Consider the following code
>>>
>>> int cmpge32_0(long a) {
>>>   return a>=0;
>>> }
>>>
>>> Compiled for the MSP430 with -O1 or -Os results in the following:
>>>
>>> ; Function Attrs: norecurse nounwind readnone
>>> define dso_local i16 @cmpge32_0(i32 %a) local_unnamed_addr #0 {
>>> entry:
>>>   %a.lobit = lshr i32 %a, 31
>>>   %0 = trunc i32 %a.lobit to i16
>>>   %.not = xor i16 %0, 1
>>>   ret i16 %.not
>>> }
>>>
>>> The backend then turns this into the following totally suboptimal code:
>>>
>>> cmpge32_0:
>>> mov r13, r12
>>> inv r12
>>> swpb r12
>>> mov.b r12, r12
>>> clrc
>>> rrc r12
>>> rra r12
>>> rra r12
>>> rra r12
>>> rra r12
>>> rra r12
>>> rra r12
>>> ret
>>> .Lfunc_end0:
>>> .size cmpge32_0, .Lfunc_end0-cmpge32_0
>>>
>>>
>>> The cause of this anomaly is again the presence of the Shift instruction
>>> (%a.lobit = lshr i32 %a, 31) at the IR level, which is hard to handle
>>> by the backend.
>>>
>>> The same C code compiled with -O0 creates the following IR code excerpt
>>> instead of the lshr-trunc code sequence
>>>
>>>   %cmp = icmp sge i32 %0, 0
>>>   %conv = zext i1 %cmp to i16
>>>
>>> This compiles into MUCH better code for the MSP430 architecture (and
>>> virtually any 16 bit architecture not supporting multiple shifts).
>>> It would be desirable that LLVM would just leave the comparison as is,
>>> also for  -O1 and above.
>>>
>>>
>>> So Please, can somebody point me to the LLVM class or function that
>>> performs the transformation of the comparison above into the undesired
>>> shift, so I can investigate what’s going on, or whether there’s something I
>>> can do?
>>>
>>> That would be really appreciated.
>>>
>>> John
>>>
>>>
>>> Hi Roman
>>>
>>> Not exactly, this is IR after optimizations, this is not what clang
>>> produces.
>>> To see what clang produces you want to pass -O0.
>>> All optimizations beyond that are done by LLVM.
>>>
>>>
>>>
>>> Ok, I understand such naming convention, but it is still something that
>>> happens at the IR code generation steps, and therefore the backend has
>>> little to do about.
>>>
>>> So, what are actually the hooks that I can implement or investigate to
>>> modify the undesired behaviour?
>>>
>>> John
>>>
>>>
>>> On 30 Sep 2019, at 13:35, Roman Lebedev <lebedev.ri at gmail.com> wrote:
>>>
>>> On Mon, Sep 30, 2019 at 11:52 AM Joan Lluch <joan.lluch at icloud.com>
>>> wrote:
>>>
>>>
>>> Hi Roman,
>>>
>>> Is "test" actually an implementation of a 64-bit-wide multiplication
>>> compiler-rt builtin?
>>> Then i'd think the main problem is that it is being optimized in the
>>> first place, you could end up with endless recursion…
>>>
>>>
>>> No, this is not a compiler-rt builtin. My example is of course
>>> incidentally taken from the implementation of a signed multiply, but as
>>> said, it has nothing to do with rt-builtins, I'm just using that code to
>>> show the issue. This function can’t create a recursion because it’s named
>>> ’test’, unlike any rt-buitin. You can replace the multiply in the source
>>> code by an addition, if you want to avoid calling rt-functions, but this
>>> does not change what I attempt to show. Also It’s not meant to be 64 bit
>>> wide, but 32 bit wide, because the targets I’m testing are 16 bit, so ints
>>> are 16 bit and longs are 32 bit. This is again the function I am testing:
>>>
>>>
>>> long test (long a, long b)
>>> {
>>> int neg = 0;
>>> long res;
>>>
>>> if (a < 0)
>>> {
>>>   a = -a;
>>>   neg = 1;
>>> }
>>>
>>> if (b < 0)
>>> {
>>>   b = -b;
>>>   neg = !neg;
>>> }
>>>
>>> res = a*b;
>>>
>>> if (neg)
>>>   res = -res;
>>>
>>> return res;
>>> }
>>>
>>>
>>>
>>> LLVM, not clang.
>>>
>>>
>>> I’m not sure about what you mean by that. The shown LLVM IR code is
>>> created by executing "clang” command line, so that’s what I attempt to show.
>>>
>>> Not exactly, this is IR after optimizations, this is not what clang
>>> produces.
>>> To see what clang produces you want to pass -O0.
>>> All optimizations beyond that are done by LLVM.
>>>
>>> So it’s actually the front-end that does such undesired optimisations
>>> sometimes, not only the LLVM back-end. This is in part why I am saying this
>>> is not right. See copied again the IR code that gets generated for the C
>>> code that I posted before. This IR code, including the presence of
>>> expensive shifts ( %a.lobit = lshr i32 %a, 31)  is generated when -mllvm
>>> -phi-node-folding-threshold=1 is specified in the command line, or when the
>>> Target implements getOperationCost(unsigned Opcode, Type *Ty, Type *OpTy)
>>> to return TCC_Expensive for operator types that are bigger than the default
>>> target register size.
>>>
>>>
>>>
>>> ; ModuleID = 'main.c'
>>> source_filename = "main.c"
>>> target datalayout =
>>> "e-m:e-p:16:16-i32:16-i64:16-f32:16-f64:16-a:8-n8:16-S16"
>>> target triple = "msp430"
>>>
>>> ; Function Attrs: norecurse nounwind optsize readnone
>>> define dso_local i32 @test(i32 %a, i32 %b) local_unnamed_addr #0 {
>>> entry:
>>> %cmp = icmp slt i32 %a, 0
>>> %sub = sub nsw i32 0, %a
>>> %spec.select = select i1 %cmp, i32 %sub, i32 %a
>>> %a.lobit = lshr i32 %a, 31
>>> %0 = trunc i32 %a.lobit to i16
>>> %cmp1 = icmp slt i32 %b, 0
>>> br i1 %cmp1, label %if.then2, label %if.end4
>>>
>>> if.then2:                                         ; preds = %entry
>>> %sub3 = sub nsw i32 0, %b
>>> %1 = xor i16 %0, 1
>>> br label %if.end4
>>>
>>> if.end4:                                          ; preds = %if.then2,
>>> %entry
>>> %b.addr.0 = phi i32 [ %sub3, %if.then2 ], [ %b, %entry ]
>>> %neg.1 = phi i16 [ %1, %if.then2 ], [ %0, %entry ]
>>> %mul = mul nsw i32 %b.addr.0, %spec.select
>>> %tobool5 = icmp eq i16 %neg.1, 0
>>> %sub7 = sub nsw i32 0, %mul
>>> %spec.select18 = select i1 %tobool5, i32 %mul, i32 %sub7
>>> ret i32 %spec.select18
>>> }
>>>
>>> attributes #0 = { norecurse nounwind optsize readnone
>>> "correctly-rounded-divide-sqrt-fp-math"="false"
>>> "disable-tail-calls"="false" "less-precise-fpmad"="false"
>>> "min-legal-vector-width"="0" "no-frame-pointer-elim"="false"
>>> "no-infs-fp-math"="false" "no-jump-tables"="false"
>>> "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false"
>>> "no-trapping-math"="false" "stack-protector-buffer-size"="8"
>>> "unsafe-fp-math"="false" "use-soft-float"="false" }
>>>
>>> !llvm.module.flags = !{!0}
>>> !llvm.ident = !{!1}
>>>
>>> !0 = !{i32 1, !"wchar_size", i32 2}
>>> !1 = !{!"clang version 9.0.0 (https://github.com/llvm/llvm-project.git
>>> 6f7deba43dd25fb7b3eca70f9c388ec9174f455a)"}
>>>
>>>
>>>
>>> As you can see, Clang produces a 31 bit wide shift right ( %a.lobit =
>>> lshr i32 %a, 31) That’s the fourth instruction on the IR code above. So a
>>> shift is produced instead of creating a jump to a new block, as it should
>>> be the case as per the C source code.
>>>
>>> Just as a matter of information. This is the implementation of the
>>> getOperationCost function that causes ‘clang’ to correctly replace selects
>>> by branches (desirable), but to generate shifts to fold expensive selects
>>> (undesirable)
>>>
>>>
>>> unsigned CPU74TTIImpl::getOperationCost(unsigned Opcode, Type *Ty, Type
>>> *OpTy)
>>> {
>>> // Big types are expensive
>>> unsigned OpSize = Ty->getScalarSizeInBits();
>>> if ( OpSize > 16 )
>>>   return TTI::TCC_Expensive;
>>>
>>> return BaseT::getOperationCost(Opcode, Ty, OpTy);
>>> }
>>>
>>> If the getOperationCost above function was not implemented, then clang
>>> would generate the usual series of ‘selects’. But this is even worse
>>> because selects imply speculative execution of expensive instructions, or
>>> duplicate branching created by the backend, which can’t be easily avoided.
>>>
>>> Ideally, the IR code above should just place an ‘if.then' block for the
>>> if (a < 0) statement in the C source code, instead of attempting to replace
>>> a select by a shift (!)
>>>
>>> If you want to play with these two scenarios, (1) IR code generated with
>>> branches, and (2) IR code generated with selects. This can easily be
>>> reproduced for the MSP430 target by compiling with the following options
>>> (1)  -mllvm -phi-node-folding-threshold=1 -c -S -Os
>>> (2)  -mllvm -phi-node-folding-threshold=2 -c -S -Os
>>>
>>> For 16 bit targets without selects, or expensive selects, the overall
>>> code is better with (1) because that prevents the creation of a different
>>> jump for every ‘select’ that (2) would cause. However, the presence of the
>>> ‘shift’ instruction for (1) spoils it all.
>>>
>>> Again, ideally, the use of shifts as a replacement of selects should be
>>> avoided, and an “if.then" block should be used as per the original C code.
>>>
>>> I hope this is clear now.
>>>
>>> John.
>>>
>>>
>>> On 29 Sep 2019, at 15:57, Roman Lebedev <lebedev.ri at gmail.com> wrote:
>>>
>>> On Sun, Sep 29, 2019 at 3:35 PM Joan Lluch via llvm-dev
>>> <llvm-dev at lists.llvm.org> wrote:
>>>
>>>
>>> Hi Sanjay,
>>>
>>> Actually, the CodeGenPrepare::optimizeSelectInst is not doing the best
>>> it could do in some circumstances: The case of “OptSize" for targets not
>>> supporting Select was already mentioned to be detrimental.
>>>
>>> For targets that actually have selects, but branches are cheap and
>>> generally profitable, particularly for expensive operators, the
>>> optimizeSelectInst function does not do good either. The function tries to
>>> identify consecutive selects with the same condition in order to avoid
>>> duplicate branches, which is ok, but then this effort is discarded in
>>> isFormingBranchFromSelectProfitable because the identified condition is
>>> used more than once (on the said two consecutive selects, of course), which
>>> defeats the whole purpose of checking for them, resulting in poor codegen.
>>>
>>> Yet another issue is that Clang attempts to replace ‘selects’ in the
>>> source code, by supposedly optimised code that is not ok for all targets.
>>> One example is this:
>>>
>>> LLVM, not clang.
>>>
>>> long test (long a, long b)
>>> {
>>> int neg = 0;
>>> long res;
>>>
>>> if (a < 0)
>>> {
>>>  a = -a;
>>>  neg = 1;
>>> }
>>>
>>> if (b < 0)
>>> {
>>>  b = -b;
>>>  neg = !neg;
>>> }
>>>
>>> res = a*b; //(unsigned long)a / (unsigned long)b;  // will call __udivsi3
>>>
>>> if (neg)
>>>  res = -res;
>>>
>>> return res;
>>> }
>>>
>>>
>>> This gets compiled into
>>>
>>> ; Function Attrs: norecurse nounwind readnone
>>> define dso_local i32 @test(i32 %a, i32 %b) local_unnamed_addr #0 {
>>> entry:
>>> %cmp = icmp slt i32 %a, 0
>>> %sub = sub nsw i32 0, %a
>>> %a.addr.0 = select i1 %cmp, i32 %sub, i32 %a
>>> %a.lobit = lshr i32 %a, 31
>>> %0 = trunc i32 %a.lobit to i16
>>> %cmp1 = icmp slt i32 %b, 0
>>> br i1 %cmp1, label %if.then2, label %if.end4
>>>
>>> if.then2:                                         ; preds = %entry
>>> %sub3 = sub nsw i32 0, %b
>>> %1 = xor i16 %0, 1
>>> br label %if.end4
>>>
>>> if.end4:                                          ; preds = %if.then2,
>>> %entry
>>> %b.addr.0 = phi i32 [ %sub3, %if.then2 ], [ %b, %entry ]
>>> %neg.1 = phi i16 [ %1, %if.then2 ], [ %0, %entry ]
>>> %mul = mul nsw i32 %b.addr.0, %a.addr.0
>>> %tobool5 = icmp eq i16 %neg.1, 0
>>> %sub7 = sub nsw i32 0, %mul
>>> %res.0 = select i1 %tobool5, i32 %mul, i32 %sub7
>>> ret i32 %res.0
>>> }
>>>
>>> The offending part here is this:  %a.lobit = lshr i32 %a, 31 . Instead
>>> of just creating a “select” instruction, as the original code suggested
>>> with the if (a < 0) { neg = 1;} statements, the front-end produces a lshr
>>> which is very expensive for small architectures, and makes it very
>>> difficult for the backend to fold it again into an actual select (or
>>> branch). In my opinion, the original C code should have produced a “select”
>>> and give the backend the opportunity to optimise it if required. I think
>>> that the frontend should perform only target independent optimisations.
>>>
>>>
>>> You didn't specify how you compile that code.
>>> We could also get: https://godbolt.org/z/B-5lj1
>>> Which can actually be folded further to just
>>> long test(long a, long b) {
>>>  return a * b;
>>> }
>>> Is "test" actually an implementation of a 64-bit-wide multiplication
>>> compiler-rt builtin?
>>> Then i'd think the main problem is that it is being optimized in the
>>> first place, you could end up with endless recursion...
>>>
>>> I posted before my view that LLVM is clearly designed to satisfy big
>>> boys such as the x86 and ARM targets. This means that, unfortunately, it
>>> makes too many general assumptions about what’s cheap, without providing
>>> enough hooks to cancel arbitrary optimisations. As I am implementing
>>> backends for 8 or 16 bit targets, I find myself doing a lot of work just to
>>> reverse optimisations that should have not been applied in the first place.
>>> My example above is an instance of a code mutation performed by the
>>> frontend that is not desirable. Existing 8 and 16 bit trunk targets
>>> (particularly the MSP430 and the AVR) are also negatively affected by the
>>> excessively liberal use of shifts by LLVM.
>>>
>>> The CodeGenPrepare::optimizeSelectInst function needs some changes to
>>> respect targets with no selects, and targets that may want to avoid
>>> expensive speculative executions.
>>>
>>> John
>>>
>>> Roman
>>>
>>> On 25 Sep 2019, at 16:00, Sanjay Patel <spatel at rotateright.com> wrote:
>>>
>>> Changing the order of the checks in CodeGenPrepare::optimizeSelectInst()
>>> sounds good to me.
>>>
>>> But you may need to go further for optimum performance. For example, we
>>> may be canonicalizing math/logic IR patterns into 'select' such as in the
>>> recent:
>>> https://reviews.llvm.org/D67799
>>>
>>> So if you want those to become ALU ops again rather than branches, then
>>> you need to do the transform later in the backend. That is, you want to let
>>> DAGCombiner run its set of transforms on 'select' nodes.
>>>
>>> On Wed, Sep 25, 2019 at 4:03 AM Joan Lluch via cfe-dev <
>>> cfe-dev at lists.llvm.org> wrote:
>>>
>>>
>>> Hi Craig,
>>>
>>> Thank you for your reply. I have started looking at “CodeGenPrepare” and
>>> I assume you reffer to CodeGenPrepare::optimizeSelectInst. I will try to
>>> play a bit with that possibly later today. At first glance, it looks to me
>>> that for targets that do not support ’select’ at all, the fact that the
>>> function exits early for ‘OptSize’ can be detrimental, because this will
>>> just leave ALL existing selects in the code anyway. As said, I will try to
>>> play with that later, but right now it looks to me that maybe we should
>>> check  for TLI->isSelectSupported earlier in the function, to get some more
>>> opportunities to such targets without explicit ’select’ support?
>>>
>>> Thanks
>>>
>>> John
>>>
>>>
>>> On 25 Sep 2019, at 08:59, Craig Topper <craig.topper at gmail.com> wrote:
>>>
>>> There is code in CodeGenPrepare.cpp that can turn selects into branches
>>> that tries to account for multiple selects sharing the same condition. It
>>> doesn't look like either AVR or MSP430 enable that code though.
>>>
>>> ~Craig
>>>
>>>
>>> On Tue, Sep 24, 2019 at 11:27 PM Joan Lluch via cfe-dev <
>>> cfe-dev at lists.llvm.org> wrote:
>>>
>>>
>>> Hi Roman,
>>>
>>> Thank you for your reply. I understand your point. I just want to add
>>> something to clarify my original post in relation to your reply.
>>>
>>> There are already implemented 8-bit and 16-bit backends, namely the AVR
>>> and the MSP430, which already "aggressively convert selects into branches”,
>>> which already benefit (as they are) from setting
>>> "phi-node-folding-threshold’ to 1 or zero. This is because otherwise Clang
>>> will generate several selects depending on the same “icmp”. These backends
>>> are unable to optimise that, and they just create a comparison and a
>>> conditional branch for every “select” in the IR code, in spite that the
>>> original C code was already written in a much better way. So the resulting
>>> effect is the presence of redundant comparisons and branches in the final
>>> code, with a detrimental of generated code quality.
>>>
>>> The above gets improved by setting "phi-node-folding-threshold’ to 1
>>> because some of these extra ‘selects' are no longer there so the backend
>>> stops generating redundant code.
>>>
>>> John.
>>>
>>>
>>>
>>>
>>> On 21 Sep 2019, at 14:48, Roman Lebedev <lebedev.ri at gmail.com> wrote:
>>>
>>> On Sat, Sep 21, 2019 at 3:18 PM Joan Lluch via cfe-dev
>>> <cfe-dev at lists.llvm.org> wrote:
>>>
>>>
>>> Hi all,
>>>
>>> For my custom architecture, I want to relax the CFG simplification pass,
>>> and any other passes replacing conditional branches.
>>>
>>> I found that the replacement of conditional branches by “select" and
>>> other instructions is often too aggressive, and this causes inefficient
>>> code for my target as in most cases branches would be cheaper.
>>>
>>> For example, considering the following c code:
>>>
>>> long test (long a, long b)
>>> {
>>> int neg = 0;
>>> long res;
>>>
>>> if (a < 0)
>>> {
>>> a = -a;
>>> neg = 1;
>>> }
>>>
>>> res = a*b;
>>>
>>> if (neg)
>>> res = -res;
>>>
>>> return res;
>>> }
>>>
>>>
>>> This code can be simplified in c, but it’s just an example to show the
>>> point.
>>>
>>> The code above gets compiled like this (-Oz flag):
>>>
>>> ; Function Attrs: minsize norecurse nounwind optsize readnone
>>> define dso_local i32 @test(i32 %a, i32 %b) local_unnamed_addr #0 {
>>> entry:
>>> %cmp = icmp slt i32 %a, 0
>>> %sub = sub nsw i32 0, %a
>>> %a.addr.0 = select i1 %cmp, i32 %sub, i32 %a
>>> %mul = mul nsw i32 %a.addr.0, %b
>>> %sub2 = sub nsw i32 0, %mul
>>> %res.0 = select i1 %cmp, i32 %sub2, i32 %mul
>>> ret i32 %res.0
>>> }
>>>
>>>
>>> All branching was removed and replaced by ‘select’ instructions. For my
>>> architecture, it would be desirable to keep the original branches in most
>>> cases, because even simple 32 bit operations are too expensive to
>>> speculatively execute them, and branches are cheap.
>>>
>>> Setting  'phi-node-folding-threshold’ to 1 or even 0 (instead of the
>>> default 2), definitely improves the situation in many cases, but Clang
>>> still creates many instances of ‘select’ instructions, which are
>>> detrimental to my target. I am unsure about where are they created, as I
>>> believe that the simplifycfg pass does not longer create them.
>>>
>>> You definitively can't ban llvm passes/clang from creating select's.
>>>
>>> So the question is: Are there any other hooks in clang, or custom code
>>> that I can implement, to relax the creation of ’select’ instructions and
>>> make it preserve branches in the original c code?
>>>
>>> I think this is backwards.
>>> Sure, you could maybe disable most of the folds that produce selects.
>>> That may be good for final codegen, but will also affect other passes
>>> since not everything deals with 2-node PHI as good as wit selects.
>>>
>>> But, what happens if you still get the select-y IR?
>>> Doesn't matter how, could be hand-written.
>>>
>>> I think you might want to instead aggressively convert selects into
>>> branches in backend.
>>>
>>> Thanks,
>>>
>>> John
>>>
>>> Roman
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>> _______________________________________________
>>> cfe-dev mailing list
>>> cfe-dev at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>
>>>
>>> _______________________________________________
>>> cfe-dev mailing list
>>> cfe-dev at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>
>>>
>>>
>>> _______________________________________________
>>> cfe-dev mailing list
>>> cfe-dev at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20191002/86b9f199/attachment-0001.html>


More information about the llvm-dev mailing list