[PATCH] D77804: [DAG] Enable ISD::SRL SimplifyMultipleUseDemandedBits handling inside SimplifyDemandedBits (WIP)

Mon May 16 06:36:02 PDT 2022

foad added inline comments.

================
Comment at: llvm/test/CodeGen/AMDGPU/trunc-combine.ll:148
 ; SI-NEXT:    v_or_b32_e32 v0, v0, v1
-; SI-NEXT:    v_lshrrev_b32_e32 v1, 16, v0
+; SI-NEXT:    v_and_b32_e32 v1, s4, v2
 ; SI-NEXT:    s_setpc_b64 s[30:31]
----------------
arsenm wrote:
> arsenm wrote:
> > RKSimon wrote:
> > > arsenm wrote:
> > > > RKSimon wrote:
> > > > > @arsenm @foad Not sure if pulling out the immediate is a good idea or not - shouldn't a u16 immediate be cheap?
> > > > This is worse. Integer constants -16 to 64 and a handful of FP values are free, but 0xffff is not so it requires materialization.
> > > @arsenm @foad At EuroLLVM Matt suggested that maybe we should increase the tolerance to 2 uses of the large immediates before pulling out the constant?
> > s_mov_b32 K + 2 * v_and_b32_32 = 16 bytes, 12 cycles
> > 2 * (v_and_b32_e32 K) = 16 bytes, 8 cycles which is clearly better.
> > 
> > 3 * (v_and_b32_e32 K) = 24 bytes, 12 cycles
> > 
> > So 2 uses of a constant seems plainly better for VOP1/VOP2 ops. Abbe that it becomes a code size vs. latency tradeoff
> This decision is also generally made by SIFoldOperands. Probably need to fix it there and not in the DAG
I'm strongly in favour of never pulling out the constant (or rather, always folding into the instruction) and I have patches to that effect starting with D114643, which I'm hoping to get back to pretty soon.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D77804/new/

https://reviews.llvm.org/D77804