[PATCH] D62908: [PowerPC] Improve float vector gather codegen
Jinsong Ji via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Wed Sep 25 07:37:19 PDT 2019
jsji added inline comments.
================
Comment at: llvm/lib/Target/PowerPC/PPCInstrVSX.td:3986
+ (f32 (load xoaddr:$D)))),
+ (v4f32 (XXPERMDI (XXMRGHW AlignValues.LD32D, AlignValues.LD32C),
+ (XXMRGHW AlignValues.LD32B, AlignValues.LD32A), 3))>;
----------------
nemanjai wrote:
> jsji wrote:
> > jsji wrote:
> > > nemanjai wrote:
> > > > jsji wrote:
> > > > > nemanjai wrote:
> > > > > > jsji wrote:
> > > > > > > What is the benefit of merging 'AB', 'CD', instead of original 'AC', 'BD' then `vmrgew`?
> > > > > > >
> > > > > > > `vmrgew` is 2 cycle ALU instruction, should still be better than 3 cycler `xxpermdi` here.
> > > > > > We should favour the larger register file available to `XXPERMDI` here rather than `VMRGEW`.
> > > > > > Besides, where does the information about `XXPERMDI` taking 3 cycles come from? It is not listed in the UM and a similar instruction (`XXSEL` is a 2 cycle ALU instruction as well).
> > > > > Yes, if the cycles is the same, then we should favor larger reg files.
> > > > > But looks like they are not the same to me.
> > > > >
> > > > > Unfortunately, UM is missing detail cycle information about `xxpermdi`,
> > > > > but since it is a permute instruction, all PM instructions are 3 cycles.
> > > > > `xxsel` is *ALU* instruction, hence it is 2 cycles.
> > > > >
> > > > > And we are modeling that in our scheduling info as well.
> > > > >
> > > > > ```
> > > > > $ grep XXPERMDI llvm/lib/Target/PowerPC/P9InstrResources.td -B 100|grep def -B 3
> > > > > // Three Cycle PM operation. Only one PM unit per superslice so we use the whole
> > > > > // superslice. That includes both exec pipelines (EXECO, EXECE) and one
> > > > > // dispatch.
> > > > > def : InstRW<[P9_PM_3C, IP_EXECO_1C, IP_EXECE_1C, DISP_1C],
> > > > > ```
> > > > I understand that you are making an assumption that the XXPERMDI is a PM instruction. And I agree that this seems perfectly reasonable. But I do not think it is a given nor does it matter that we made the same assumption in our modeling for the scheduler.
> > > >
> > > > However, none of this proves that this is a 3 cycle instruction - especially since PM instructions typically range in latency between 2 and 3 cycles. Furthermore, since the entire sequence of instructions in the output pattern has access to the full set of VSX registers, I think that the larger register set matters more.
> > > >
> > > > In code where this patch will make a performance difference, the vector gather will likely be in a loop. Our loop unroll factor will likely ensure we gather quite a few vectors per iteration, so access to a wider register set should matter more than shaving 1 cycle on a 10/11 cycle dependent sequence.
> > > > But I do not think it is a given
> > >
> > > Yes, agree, we shouldn't. I will try to confirm with HW guys.
> > >
> > > > especially since PM instructions typically range in latency between 2 and 3 cycles.
> > >
> > > As far as I know, all PM instructions are 3 cycles, I never saw a 2 cycle one, did you? If so, let me know, I should fix it in scheduler model.
> > >
> > > > In code where this patch will make a performance difference, the vector gather will likely be in a loop. Our loop unroll factor will likely ensure we gather quite a few vectors per iteration, so access to a wider register set should matter more than shaving 1 cycle on a 10/11 cycle dependent sequence.
> > >
> > > Yes, **likely**...
> > > It depends on the reg pressure and how **likely** we may increase reg dependency or cause additional reg spill.
> > > It **might** not be a good choice most of the time when the reg pressure is not that big.
> > >
> > > Anyhow, I believe you already have some example and important performance data in mind to support your argument.
> > >
> > > But can we at least add some comments here to describe why we make such choice here -- why we prefer
> > > **access to a wider register set** than **shaving 1 cycle** here.
> > >
> > >
> > >
> > >
> > >> I understand that you are making an assumption that the XXPERMDI is a PM instruction. And I agree that this seems perfectly reasonable. But I do not think it is a given...
> > > Yes, agree, we shouldn't. I will try to confirm with HW guys.
> >
> > Confirmed with HW team, comparing to `vmrgew`, `xxpermdi is PM-routed and therefore has the longer latency`
> I think this discussion was quite valuable, so thanks for bringing it up.
> And I do agree that we should add a comment in the code. Perhaps the following comment:
>
> ```
> // Using VMRGEW to assemble the final vector would be a lower latency
> // solution. However, we choose to go with the slightly higher latency
> // XXPERMDI for 2 reasons:
> // 1. This is likely to occur in unrolled loops where regpressure is high, so we
> // want to use the latter as it has access to all 64 VSX registers.
> // 2. Using Altivec instructions in this sequence would likely cause the
> // allocation of Altivec registers even for the loads which in turn would
> // force the use of LXSIWZX for the loads, adding a cycle of latency to
> // each of the loads which would otherwise be able to use LFIWZX.
> ```
The comments looks great. Thanks!
Repository:
rL LLVM
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D62908/new/
https://reviews.llvm.org/D62908
More information about the llvm-commits
mailing list