[PATCH] D62908: [PowerPC] Improve float vector gather codegen

Tue Sep 24 03:16:18 PDT 2019

nemanjai added inline comments.

================
Comment at: llvm/lib/Target/PowerPC/PPCInstrVSX.td:3986
+                                   (f32 (load xoaddr:$D)))),
+              (v4f32 (XXPERMDI (XXMRGHW AlignValues.LD32D, AlignValues.LD32C),
+                               (XXMRGHW AlignValues.LD32B, AlignValues.LD32A), 3))>;
----------------
jsji wrote:
> nemanjai wrote:
> > jsji wrote:
> > > What is the benefit of merging 'AB', 'CD', instead of original 'AC', 'BD' then `vmrgew`?
> > > 
> > > `vmrgew` is 2 cycle ALU instruction, should still be better than 3 cycler `xxpermdi` here.
> > We should favour the larger register file available to `XXPERMDI` here rather than `VMRGEW`.
> > Besides, where does the information about `XXPERMDI` taking 3 cycles come from? It is not listed in the UM and a similar instruction (`XXSEL` is a 2 cycle ALU instruction as well).
> Yes, if the cycles is the same, then we should favor larger reg files.
> But looks like they are not the same to me.
> 
> Unfortunately, UM is missing detail cycle information about `xxpermdi`, 
> but since it is a permute instruction, all PM instructions are 3 cycles. 
> `xxsel` is *ALU* instruction, hence it is 2 cycles.
> 
> And we are modeling that in our scheduling info as well.
> 
> ```
> $ grep XXPERMDI llvm/lib/Target/PowerPC/P9InstrResources.td -B 100|grep def -B 3
> // Three Cycle PM operation. Only one PM unit per superslice so we use the whole
> // superslice. That includes both exec pipelines (EXECO, EXECE) and one
> // dispatch.
> def : InstRW<[P9_PM_3C, IP_EXECO_1C, IP_EXECE_1C, DISP_1C],
> ```
I understand that you are making an assumption that the XXPERMDI is a PM instruction. And I agree that this seems perfectly reasonable. But I do not think it is a given nor does it matter that we made the same assumption in our modeling for the scheduler.

However, none of this proves that this is a 3 cycle instruction - especially since PM instructions typically range in latency between 2 and 3 cycles. Furthermore, since the entire sequence of instructions in the output pattern has access to the full set of VSX registers, I think that the larger register set matters more.

In code where this patch will make a performance difference, the vector gather will likely be in a loop. Our loop unroll factor will likely ensure we gather quite a few vectors per iteration, so access to a wider register set should matter more than shaving 1 cycle on a 10/11 cycle dependent sequence.

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D62908/new/

https://reviews.llvm.org/D62908