[PATCH] D133739: [RISCV][WIP] Form more VW instructions

Wed Sep 14 12:17:38 PDT 2022

craig.topper added a comment.

In D133739#3790249 <https://reviews.llvm.org/D133739#3790249>, @reames wrote:

> In D133739#3790199 <https://reviews.llvm.org/D133739#3790199>, @craig.topper wrote:
>
>> In D133739#3790115 <https://reviews.llvm.org/D133739#3790115>, @reames wrote:
>>
>>> Have you looked at allowing the fold into the widening version without the one-use check at all?  This would allow users of the extend which could be widen instructions to use the input of the extend while leaving the extend around for any non-wideable users.
>>>
>>> Under the assumption that the widening variants execute at least as fast as the non-widening variants, this wouldn't seem to be problematic from a latency/throughput perspective.
>>>
>>> There is a register pressure concern - as we potentially have to keep both extended and non-extended version alive where previously, the unextended version might have been dead.  But in principle we have that problem every time we fold e.g. a splat into a .v.x variant of any instruction, and we don't seem to be burnt there.
>>
>> The register pressure is worse for large LMUL.
>
> Right, but this is a general problem for large LMUL.  i.e. splat and extend are the same with respect to this.
>
>> We do have an early clobber on the extend instructions anyway, so the dest already can't reuse the source register.
>
> I was wondering about needing to have two copies *live* over instructions between the extend and the last original use of the extend.  So, not at the extend instruction itself, more the live ranges extending past that.
>
>> We can only fold one 2x stage of widening. If the original sext/zext is from i8->i32/i64 or from i16->i64, the fold will create a smaller extend for the remaining part. If the original extend doesn't fold into all uses, this increases the number of instructions.
>
> We can maybe leave the one use requirement for this version of the transform?  It seems reasonable to have different heuristics for "this folds the extend entirely" and "this allows a narrower extend".  As an aside, its not really clear to me why the "narrower extend" version is profitable ever.  It would seem neutral at best.

Without looking at any particular implementation. Widening from LMUL1 to LMUL4 could be 4 microops to write each physical register. Followed by another 4 microps for the add or mul. For a total of 8 ops. Whereas widening LMUL1 to LMUL2 could be 2 microops, followed by 4 microops for doing a widening add/mul from LMUL2 to LMUL4. For a total of 6 microops.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D133739/new/

https://reviews.llvm.org/D133739