[PATCH] D74165: [x86] [DAGCombine] Prefer shifts of constant widths.

Mon Feb 10 10:57:04 PST 2020

jlebar added inline comments.

================
Comment at: llvm/test/CodeGen/X86/select.ll:1123
+; MCU-NEXT:  .LBB20_1:
+; MCU-NEXT:    shll $3, %eax
 ; MCU-NEXT:    retl
----------------
jlebar wrote:
> RKSimon wrote:
> > craig.topper wrote:
> > > jlebar wrote:
> > > > RKSimon wrote:
> > > > > Regression
> > > > Wow.  What...is...this...architecture.
> > > > 
> > > > I guess the answer is, we don't do this optimization if the target doesn't have cmov?
> > > > 
> > > > I mean, I'll do it, but are we sure the complexity is worth it for this target?  I can't even find an Agner optimization table for Intel MCU (Quark?).
> > > MCU is basically a 486 on a much more modern silicon process.
> > Its not just the MCU case - all of these targets' codegen look worse tbh
> It does look worse.  I wanted to dig in more...
> 
> I somewhat arbitrarily tried Core2, Westmere, and Haswell with the x86-32 and x86-64 code.
> 
>  * *x86-32* `llc < testcases/shift-regression.ll -mtriple=i386-apple-darwin10 --x86-asm-syntax=intel -mcpu=athlon | llvm-mca -mcpu=<arch>`
>  * *x86-64* `llc < testcases/shift-regression.ll -mtriple=amd64-apple-darwin10 --x86-asm-syntax=intel | llvm-mca -mcpu=<arch>`
> 
> I'm reporting "total cycles".
> 
>  * Core2 x86-32: HEAD 211 / patched 304
>  * Core2 x86-64: HEAD 207 / patched 204
>  * Westmere x86-32: HEAD 211 / patched 304
>  * Westmere x86-64: HEAD 207 / patched 204
>  * Haswell x86-32: HEAD 310 / patched 309
>  * Haswell x86-64: HEAD 210 / patched 210
> 
> IOW for these CPUs it's a regression only in the x86-32 code and only in the older CPUs.
> 
> One thing we could do is limit this transformation to x86-64?  Or we could do more investigation, in which case I'm curious to know what kind of evidence you'd need to be comfortable with it.
After thinking about this more...

The reason this testcase is sort of special is because the two shift widths are `N` and `N+1`, therefore it's very easy to convert from the boolean into the shift width. 

In a sense, what's happening here is not `shift x, (select ...)`, but rather `shift x, (add ...)`.  The bug is that I'm transforming the latter, when I only want to touch the former.

I fixed the regression by restricting this optimization to happen only later in the SelectionDAG pipeline, after which the select is folded into an add.  Surprisingly, my change now doesn't affect *any* existing x86 codegen tests, other than the ones that I added.  This makes me a little skeptical I've done something wrong, so I'd appreciate a careful review.

I've also now excluded vector types from this transformation, taking care of the TODO from earlier, and added tests.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D74165/new/

https://reviews.llvm.org/D74165