[PATCH] D62908: [PowerPC] Improve float vector gather codegen

Wed Sep 25 04:59:34 PDT 2019

nemanjai added inline comments.

================
Comment at: llvm/lib/Target/PowerPC/PPCInstrVSX.td:3986
+                                   (f32 (load xoaddr:$D)))),
+              (v4f32 (XXPERMDI (XXMRGHW AlignValues.LD32D, AlignValues.LD32C),
+                               (XXMRGHW AlignValues.LD32B, AlignValues.LD32A), 3))>;
----------------
jsji wrote:
> jsji wrote:
> > nemanjai wrote:
> > > jsji wrote:
> > > > nemanjai wrote:
> > > > > jsji wrote:
> > > > > > What is the benefit of merging 'AB', 'CD', instead of original 'AC', 'BD' then `vmrgew`?
> > > > > > 
> > > > > > `vmrgew` is 2 cycle ALU instruction, should still be better than 3 cycler `xxpermdi` here.
> > > > > We should favour the larger register file available to `XXPERMDI` here rather than `VMRGEW`.
> > > > > Besides, where does the information about `XXPERMDI` taking 3 cycles come from? It is not listed in the UM and a similar instruction (`XXSEL` is a 2 cycle ALU instruction as well).
> > > > Yes, if the cycles is the same, then we should favor larger reg files.
> > > > But looks like they are not the same to me.
> > > > 
> > > > Unfortunately, UM is missing detail cycle information about `xxpermdi`, 
> > > > but since it is a permute instruction, all PM instructions are 3 cycles. 
> > > > `xxsel` is *ALU* instruction, hence it is 2 cycles.
> > > > 
> > > > And we are modeling that in our scheduling info as well.
> > > > 
> > > > ```
> > > > $ grep XXPERMDI llvm/lib/Target/PowerPC/P9InstrResources.td -B 100|grep def -B 3
> > > > // Three Cycle PM operation. Only one PM unit per superslice so we use the whole
> > > > // superslice. That includes both exec pipelines (EXECO, EXECE) and one
> > > > // dispatch.
> > > > def : InstRW<[P9_PM_3C, IP_EXECO_1C, IP_EXECE_1C, DISP_1C],
> > > > ```
> > > I understand that you are making an assumption that the XXPERMDI is a PM instruction. And I agree that this seems perfectly reasonable. But I do not think it is a given nor does it matter that we made the same assumption in our modeling for the scheduler.
> > > 
> > > However, none of this proves that this is a 3 cycle instruction - especially since PM instructions typically range in latency between 2 and 3 cycles. Furthermore, since the entire sequence of instructions in the output pattern has access to the full set of VSX registers, I think that the larger register set matters more.
> > > 
> > > In code where this patch will make a performance difference, the vector gather will likely be in a loop. Our loop unroll factor will likely ensure we gather quite a few vectors per iteration, so access to a wider register set should matter more than shaving 1 cycle on a 10/11 cycle dependent sequence.
> > > But I do not think it is a given
> > 
> > Yes, agree, we shouldn't. I will try to confirm with HW guys. 
> > 
> > > especially since PM instructions typically range in latency between 2 and 3 cycles. 
> > 
> > As far as I know, all PM instructions are 3 cycles, I never saw a 2 cycle one, did you? If so, let me know, I should fix it in scheduler model.
> > 
> > > In code where this patch will make a performance difference, the vector gather will likely be in a loop. Our loop unroll factor will likely ensure we gather quite a few vectors per iteration, so access to a wider register set should matter more than shaving 1 cycle on a 10/11 cycle dependent sequence.
> > 
> > Yes, **likely**... 
> > It depends on the reg pressure and how **likely** we may increase reg dependency or cause additional reg spill.
> > It **might** not be a good choice most of the time when the reg pressure is not that big.
> > 
> > Anyhow, I believe you already have some example and important performance data in mind to support your argument.
> > 
> > But can we at least add some comments here to describe why we make such choice here -- why we prefer 
> > **access to a wider register set** than **shaving 1 cycle** here.
> > 
> > 
> > 
> > 
> >> I understand that you are making an assumption that the XXPERMDI is a PM instruction. And I agree that this seems perfectly reasonable.  But I do not think it is a given...
> > Yes, agree, we shouldn't. I will try to confirm with HW guys.
> 
> Confirmed with HW team, comparing to `vmrgew`, `xxpermdi is PM-routed and therefore has the longer latency`
I think this discussion was quite valuable, so thanks for bringing it up.
And I do agree that we should add a comment in the code. Perhaps the following comment:

```
// Using VMRGEW to assemble the final vector would be a lower latency
// solution. However, we choose to go with the slightly higher latency
// XXPERMDI for 2 reasons:
// 1. This is likely to occur in unrolled loops where regpressure is high, so we
//    want to use the latter as it has access to all 64 VSX registers.
// 2. Using Altivec instructions in this sequence would likely cause the
//    allocation of Altivec registers even for the loads which in turn would
//    force the use of LXSIWZX for the loads, adding a cycle of latency to
//    each of the loads which would otherwise be able to use LFIWZX.
```

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D62908/new/

https://reviews.llvm.org/D62908