[PATCH] D62908: [PowerPC] Improve float vector gather codegen

Tue Sep 24 07:47:13 PDT 2019

jsji added inline comments.

================
Comment at: llvm/lib/Target/PowerPC/PPCInstrVSX.td:3986
+                                   (f32 (load xoaddr:$D)))),
+              (v4f32 (XXPERMDI (XXMRGHW AlignValues.LD32D, AlignValues.LD32C),
+                               (XXMRGHW AlignValues.LD32B, AlignValues.LD32A), 3))>;
----------------
nemanjai wrote:
> jsji wrote:
> > nemanjai wrote:
> > > jsji wrote:
> > > > What is the benefit of merging 'AB', 'CD', instead of original 'AC', 'BD' then `vmrgew`?
> > > > 
> > > > `vmrgew` is 2 cycle ALU instruction, should still be better than 3 cycler `xxpermdi` here.
> > > We should favour the larger register file available to `XXPERMDI` here rather than `VMRGEW`.
> > > Besides, where does the information about `XXPERMDI` taking 3 cycles come from? It is not listed in the UM and a similar instruction (`XXSEL` is a 2 cycle ALU instruction as well).
> > Yes, if the cycles is the same, then we should favor larger reg files.
> > But looks like they are not the same to me.
> > 
> > Unfortunately, UM is missing detail cycle information about `xxpermdi`, 
> > but since it is a permute instruction, all PM instructions are 3 cycles. 
> > `xxsel` is *ALU* instruction, hence it is 2 cycles.
> > 
> > And we are modeling that in our scheduling info as well.
> > 
> > ```
> > $ grep XXPERMDI llvm/lib/Target/PowerPC/P9InstrResources.td -B 100|grep def -B 3
> > // Three Cycle PM operation. Only one PM unit per superslice so we use the whole
> > // superslice. That includes both exec pipelines (EXECO, EXECE) and one
> > // dispatch.
> > def : InstRW<[P9_PM_3C, IP_EXECO_1C, IP_EXECE_1C, DISP_1C],
> > ```
> I understand that you are making an assumption that the XXPERMDI is a PM instruction. And I agree that this seems perfectly reasonable. But I do not think it is a given nor does it matter that we made the same assumption in our modeling for the scheduler.
> 
> However, none of this proves that this is a 3 cycle instruction - especially since PM instructions typically range in latency between 2 and 3 cycles. Furthermore, since the entire sequence of instructions in the output pattern has access to the full set of VSX registers, I think that the larger register set matters more.
> 
> In code where this patch will make a performance difference, the vector gather will likely be in a loop. Our loop unroll factor will likely ensure we gather quite a few vectors per iteration, so access to a wider register set should matter more than shaving 1 cycle on a 10/11 cycle dependent sequence.
> But I do not think it is a given

Yes, agree, we shouldn't. I will try to confirm with HW guys. 

> especially since PM instructions typically range in latency between 2 and 3 cycles. 

As far as I know, all PM instructions are 3 cycles, I never saw a 2 cycle one, did you? If so, let me know, I should fix it in scheduler model.

> In code where this patch will make a performance difference, the vector gather will likely be in a loop. Our loop unroll factor will likely ensure we gather quite a few vectors per iteration, so access to a wider register set should matter more than shaving 1 cycle on a 10/11 cycle dependent sequence.

Yes, **likely**... 
It depends on the reg pressure and how **likely** we may increase reg dependency or cause additional reg spill.
It **might** not be a good choice most of the time when the reg pressure is not that big.

Anyhow, I believe you already have some example and important performance data in mind to support your argument.

But can we at least add some comments here to describe why we make such choice here -- why we prefer 
**access to a wider register set** than **shaving 1 cycle** here.

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D62908/new/

https://reviews.llvm.org/D62908