[PATCH] D132520: [X86] Promote i8/i16 CTTZ (BSF) instructions and remove speculation branch

Fri Aug 26 23:10:21 PDT 2022

pcordes added a comment.

Looks good to me.  Ideas for further minor improvements.  (I guess I should be filing these as missed-optimization bug reports now that this commit is in?)

If we only need a 16-bit output, we can potentially `tzcnt %di, %ax`, at the cost of a false dependency merging into AX.  Again probably only sensible at -Os or -Oz.  And of course that requires that we know tzcnt will run as `tzcnt`, not `bsf`.  `tzcnt` itself produces the operand-size when the input is zero, like we're emulating without it.

But `tzcnt` already has a false output dependency on Intel CPUs from Sandybridge through Broadwell (fixed in Skylake for tzcnt/lzcnt, but not popcnt until later).

One way to avoid that (and still emulate `tzcnt` by setting a high bit):

  leal    65536(%rdi), %eax
  rep bsf  %eax, %eax

If EDI holds a 16-bit function arg that we require to be zero-extended to 16-bit, that's equivalent to OR.  If it was sign-extended to 16-bit, the top half might be all-ones in which case this would clear it to zero.  But the high half being non-zero would imply that the sign bit was set in the low half, so that's actually fine.  We only need bit 16 set when the low half is zero, and this does achieve that goal.

(Assuming our caller respects clang's unofficial extension to the x86-64 SysV calling convention where narrow args are extended to 32-bit, which GCC follows but ICC doesn't.  If the high half is arbitrary garbage when the low half is zero, no constant we can add to it can guarantee non-zero, let alone producing 16 exactly.  So LEA can't work then.)

================
Comment at: llvm/test/CodeGen/X86/clz.ll:553
+; X86-NEXT:    movl $65536, %eax # imm = 0x10000
+; X86-NEXT:    orl {{[0-9]+}}(%esp), %eax
 ; X86-NEXT:    rep bsfl %eax, %eax
----------------
This would be even better as

```
 mov  4(%esp), %eax
 or      $65536, %eax
```

Only 2 back-end uops (instead of 1 mov-immediate + 1 micro-fused load+or).  Pretty much equivalent for the front-end and ROB, but takes one fewer entry in the RS until these execute.  The mov-immediate is independent so it can exec early, but still takes a cycle on an execution unit.  And then we're left with load and OR uops, just like with my asm. 
The one advantage is that the mov-immediate could possibly retire from the ROB while still stalled waiting for the load, but that seems pretty unlikely.  (Cache miss on stack memory in a function with stack args, or at the end of a long dependency chain involving storing the arg).

That movzx + or-immediate asm is what we get from C++ source that manually sets a bit, `__builtin_ctz(x|0x10000)`
https://godbolt.org/z/d63WM5s4c (clang nightly after this revision landed).

Or from `std::countr_zero(*p);` or on a reference, so it seems this mov-immediate thing is only with by-value stack args, and isn't a concern for the general case of other addressing modes.  So always a simple addressing mode, ESP or EBP, where we won't have un-lamination on Sandy/Ivy Bridge.  Also of course avoids reading outside the pointed-to 16-bit object,  which could cross into an unmapped page.  (Or suffer a slowdown from a cache-line split).

Anyway, what we do when dereferencing a pointer like for `std::countr_zero(*p);`is slightly more efficient than what we do for a by-value stack arg.  

But it seems this is not specific to CTTZ lowering; we get mov-immediate / or mem,reg from `return x|0x1000;`

(vs. 64-bit mode correctly choosing `mov %edi, %eax` / `or $65536, %eax`, which puts the mov on the critical path on CPUs without mov-elimination, like Ice Lake with updated microcode :(.  But it saves code size.  In the 32-bit case there's no latency advantage: a load has to happen as part of the critical path either way, and OR register has the same latency as OR-immediate.)

BTW, with `-Oz` at least, we should be using 4-byte `bts $16, %eax` instead of 5-byte `or $65536, %eax` (or 6-byte for other registers, plus a REX if needed).  On Intel CPUs its still only 1 uop, although can run on fewer execution ports (p06 in SKL/ICL, only p1 in Alder Lake P-cores).  Appropriate at least for `-Os -mtune=intel` (or any specific Sandybridge-family) if we want to be that fine-grained about different instruction selection, probably even `-O2 -mtune=intel`.  Although maybe not, since Alder Lake P-cores dropped the throughput to 1, competing with imul and tzcnt/lzcnt/popcnt for that port.  So maybe only for a -march before that?  But people normally expect -march=haswell to be good on later Intel, and it's a pretty small savings.

On AMD CPUs, `bts $imm, %reg` is 2 uops, so only appropriate for -Oz and maybe -Os.

================
Comment at: llvm/test/CodeGen/X86/clz.ll:560
 ; X64:       # %bb.0:
-; X64-NEXT:    testw %di, %di
-; X64-NEXT:    je .LBB13_1
-; X64-NEXT:  # %bb.2: # %cond.false
+; X64-NEXT:    orl $65536, %edi # imm = 0x10000
 ; X64-NEXT:    rep bsfl %edi, %eax
----------------
BTW, with `-Oz` at least, we should be using 4-byte `bts $16, %edi` instead of 6-byte `or $65536, %edi` (or 5-byte for EAX).

On Intel CPUs `bts $i8, %reg` is still only 1 uop, although can run on fewer execution ports than `or` (p06 in SKL/ICL; the shift ports.  Only p1 in Alder Lake P-cores).  Appropriate at least for `-Os -mtune=intel` (or any specific Sandybridge-family) if we want to be that fine-grained about different instruction selection.  
Perhaps even `-O2 -mtune=intel`.  Although maybe not, since Alder Lake P-cores dropped the throughput to 1, competing with imul and tzcnt/lzcnt/popcnt for that port.  (BTS still has 1 cycle latency, unlike most integer uops that can only run on port 1.  Alder Lake E-cores run as 1 uop with 1 cycle latency to the integer output, 2 cycle latency to the CF output.)

So maybe only for -O2 with a -march before Alder Lake?  But people normally expect -march=haswell to be good on later Intel, and it's a pretty small savings, just 2 bytes.  OTOH it might be a pretty small gain, unless used in a loop with a port 1 bottleneck.

On AMD CPUs, `bts $imm, %reg` is 2 uops, so only appropriate for -Oz and maybe -Os.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D132520/new/

https://reviews.llvm.org/D132520