[PATCH] D62908: [PowerPC] Improve float vector gather codegen

Tue Sep 24 08:42:27 PDT 2019

jsji added inline comments.

================
Comment at: llvm/lib/Target/PowerPC/PPCInstrVSX.td:3986
+                                   (f32 (load xoaddr:$D)))),
+              (v4f32 (XXPERMDI (XXMRGHW AlignValues.LD32D, AlignValues.LD32C),
+                               (XXMRGHW AlignValues.LD32B, AlignValues.LD32A), 3))>;
----------------
jsji wrote:
> nemanjai wrote:
> > jsji wrote:
> > > nemanjai wrote:
> > > > jsji wrote:
> > > > > What is the benefit of merging 'AB', 'CD', instead of original 'AC', 'BD' then `vmrgew`?
> > > > > 
> > > > > `vmrgew` is 2 cycle ALU instruction, should still be better than 3 cycler `xxpermdi` here.
> > > > We should favour the larger register file available to `XXPERMDI` here rather than `VMRGEW`.
> > > > Besides, where does the information about `XXPERMDI` taking 3 cycles come from? It is not listed in the UM and a similar instruction (`XXSEL` is a 2 cycle ALU instruction as well).
> > > Yes, if the cycles is the same, then we should favor larger reg files.
> > > But looks like they are not the same to me.
> > > 
> > > Unfortunately, UM is missing detail cycle information about `xxpermdi`, 
> > > but since it is a permute instruction, all PM instructions are 3 cycles. 
> > > `xxsel` is *ALU* instruction, hence it is 2 cycles.
> > > 
> > > And we are modeling that in our scheduling info as well.
> > > 
> > > ```
> > > $ grep XXPERMDI llvm/lib/Target/PowerPC/P9InstrResources.td -B 100|grep def -B 3
> > > // Three Cycle PM operation. Only one PM unit per superslice so we use the whole
> > > // superslice. That includes both exec pipelines (EXECO, EXECE) and one
> > > // dispatch.
> > > def : InstRW<[P9_PM_3C, IP_EXECO_1C, IP_EXECE_1C, DISP_1C],
> > > ```
> > I understand that you are making an assumption that the XXPERMDI is a PM instruction. And I agree that this seems perfectly reasonable. But I do not think it is a given nor does it matter that we made the same assumption in our modeling for the scheduler.
> > 
> > However, none of this proves that this is a 3 cycle instruction - especially since PM instructions typically range in latency between 2 and 3 cycles. Furthermore, since the entire sequence of instructions in the output pattern has access to the full set of VSX registers, I think that the larger register set matters more.
> > 
> > In code where this patch will make a performance difference, the vector gather will likely be in a loop. Our loop unroll factor will likely ensure we gather quite a few vectors per iteration, so access to a wider register set should matter more than shaving 1 cycle on a 10/11 cycle dependent sequence.
> > But I do not think it is a given
> 
> Yes, agree, we shouldn't. I will try to confirm with HW guys. 
> 
> > especially since PM instructions typically range in latency between 2 and 3 cycles. 
> 
> As far as I know, all PM instructions are 3 cycles, I never saw a 2 cycle one, did you? If so, let me know, I should fix it in scheduler model.
> 
> > In code where this patch will make a performance difference, the vector gather will likely be in a loop. Our loop unroll factor will likely ensure we gather quite a few vectors per iteration, so access to a wider register set should matter more than shaving 1 cycle on a 10/11 cycle dependent sequence.
> 
> Yes, **likely**... 
> It depends on the reg pressure and how **likely** we may increase reg dependency or cause additional reg spill.
> It **might** not be a good choice most of the time when the reg pressure is not that big.
> 
> Anyhow, I believe you already have some example and important performance data in mind to support your argument.
> 
> But can we at least add some comments here to describe why we make such choice here -- why we prefer 
> **access to a wider register set** than **shaving 1 cycle** here.
> 
> 
> 
> 
>> I understand that you are making an assumption that the XXPERMDI is a PM instruction. And I agree that this seems perfectly reasonable.  But I do not think it is a given...
> Yes, agree, we shouldn't. I will try to confirm with HW guys.

Confirmed with HW team, comparing to `vmrgew`, `xxpermdi is PM-routed and therefore has the longer latency`

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D62908/new/

https://reviews.llvm.org/D62908