[PATCH] D140069: [DAGCombiner] Scalarize vectorized loads that are splatted
Jay Foad via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Thu Dec 22 00:06:30 PST 2022
foad added inline comments.
================
Comment at: llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll:1905-1914
define <2 x half> @hi16bits(ptr addrspace(1) %x0, ptr addrspace(1) %x1) {
; GFX9-LABEL: hi16bits:
; GFX9: ; %bb.0: ; %entry
; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX9-NEXT: global_load_dword v4, v[0:1], off
-; GFX9-NEXT: global_load_dword v5, v[2:3], off
-; GFX9-NEXT: s_mov_b32 s4, 0x7060302
+; GFX9-NEXT: global_load_dword v2, v[2:3], off
; GFX9-NEXT: s_waitcnt vmcnt(0)
+; GFX9-NEXT: global_load_short_d16 v2, v[0:1], off offset:2
----------------
arsenm wrote:
> ruiling wrote:
> > foad wrote:
> > > luke wrote:
> > > > @foad @ruiling Apologies if I'm pinging the wrong people here, just wanted to get some AMDGPU eyes over this.
> > > > From what I understand this looks like a regression since the two loads aren't dispatched in tandem anymore, there's separate waits. Are there any suggestions as to how to avoid this/are there any target info hooks that might be relevant here?
> > > Yes it looks like a regression but I'm not sure how serious it is.
> > >
> > > The original code did two 4-byte loads even though we only want the upper two bytes of each value. Now we've turned the second one into a 2-byte load that overwrites part of the result of the first load, hence the WAW dependency. Why can't we also turn the first load into a 2-byte load?
> > >
> > > Also @rampitec @arsenm
> > An extra wait usually means serious regression. But I did not see why we need the s_waitcnt here. The `global_load`s should return the value in order, so there is no WAW dependency here, right? @foad
> The waitcnt insertion pass probably doesn't try to understand the tied operands of the d16 loads
Right, at the MIR level it looks like a RAW dependency because the d16 load has a tied read representing the parts of the destination register that are not overwritten. So I guess we *could* fix this in the waitcnt insertion pass. (It sounds similar to the special case for writelane in AMDGPUInsertDelayAlu.)
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D140069/new/
https://reviews.llvm.org/D140069
More information about the llvm-commits
mailing list