[PATCH] D133739: [RISCV][WIP] Form more VW instructions

Wed Sep 14 13:07:23 PDT 2022

reames added a comment.

In D133739#3790353 <https://reviews.llvm.org/D133739#3790353>, @craig.topper wrote:

> In D133739#3790249 <https://reviews.llvm.org/D133739#3790249>, @reames wrote:
>
>> In D133739#3790199 <https://reviews.llvm.org/D133739#3790199>, @craig.topper wrote:
>>
>>> In D133739#3790115 <https://reviews.llvm.org/D133739#3790115>, @reames wrote:
>>>
>>>> Have you looked at allowing the fold into the widening version without the one-use check at all?  This would allow users of the extend which could be widen instructions to use the input of the extend while leaving the extend around for any non-wideable users.
>>>>
>>>> Under the assumption that the widening variants execute at least as fast as the non-widening variants, this wouldn't seem to be problematic from a latency/throughput perspective.
>>>>
>>>> There is a register pressure concern - as we potentially have to keep both extended and non-extended version alive where previously, the unextended version might have been dead.  But in principle we have that problem every time we fold e.g. a splat into a .v.x variant of any instruction, and we don't seem to be burnt there.
>>>
>>> The register pressure is worse for large LMUL.
>>
>> Right, but this is a general problem for large LMUL.  i.e. splat and extend are the same with respect to this.
>>
>>> We do have an early clobber on the extend instructions anyway, so the dest already can't reuse the source register.
>>
>> I was wondering about needing to have two copies *live* over instructions between the extend and the last original use of the extend.  So, not at the extend instruction itself, more the live ranges extending past that.
>>
>>> We can only fold one 2x stage of widening. If the original sext/zext is from i8->i32/i64 or from i16->i64, the fold will create a smaller extend for the remaining part. If the original extend doesn't fold into all uses, this increases the number of instructions.
>>
>> We can maybe leave the one use requirement for this version of the transform?  It seems reasonable to have different heuristics for "this folds the extend entirely" and "this allows a narrower extend".  As an aside, its not really clear to me why the "narrower extend" version is profitable ever.  It would seem neutral at best.
>
> Without looking at any particular implementation. Widening from LMUL1 to LMUL4 could be 4 microops to write each physical register. Followed by another 4 microps for the add or mul. For a total of 8 ops. Whereas widening LMUL1 to LMUL2 could be 2 microops, followed by 4 microops for doing a widening add/mul from LMUL2 to LMUL4. For a total of 6 microops.

So, saying this back to you - reasonable hardware exists which incorporates extends at no additional cost, and the cost of the extend depends on result LMUL.  So folding only part of the extend into the widening op reduces total cost.  Fair enough.

My point about restricting that transform to one use while not restricting the variant that doesn't require the shift at all still seems reasonable.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D133739/new/

https://reviews.llvm.org/D133739