[PATCH] D108496: [WebAssembly] Lower v2f32 to v2f64 extending loads with promote_low
Thomas Lively via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Wed Aug 25 16:43:13 PDT 2021
tlively marked an inline comment as done.
tlively added a comment.
In D108496#2958980 <https://reviews.llvm.org/D108496#2958980>, @aheejin wrote:
>> Previously extra wide v4f32 to v4f64 extending loads would be legalized to v2f32
>> to v2f64 extending loads, which would then be scalarized by legalization. (v2f32
>> to v2f64 extending loads not produced by legalization were already being emitted
>> correctly.)
>
> Why v2f32 to v2f64 extending loads not produced by legalization are currently handled fine but not the one produced by legalization? I guess I lack the knowledge of order of transformation within isel..
v2f32 to v2f64 extending loads were previously lowered like this:
SelectionDAG has 7 nodes:
t0: ch = EntryToken
t2: i32 = WebAssemblyISD::ARGUMENT TargetConstant:i32<0>
t5: v2f32,ch = load<(load (s64) from %ir.p)> t0, t2, undef:i32
t6: v2f64 = fp_extend t5
t7: ch = WebAssemblyISD::RETURN t0, t6
...
Optimized type-legalized selection DAG: %bb.0 'load_promote_v2f62:'
SelectionDAG has 15 nodes:
t0: ch = EntryToken
t2: i32 = WebAssemblyISD::ARGUMENT TargetConstant:i32<0>
t8: i64,ch = load<(load (s64) from %ir.p)> t0, t2, undef:i32
t9: v2i64 = scalar_to_vector t8
t10: v4f32 = bitcast t9
t12: f32 = extract_vector_elt t10, Constant:i32<0>
t13: f64 = fp_extend t12
t15: f32 = extract_vector_elt t10, Constant:i32<1>
t16: f64 = fp_extend t15
t17: v2f64 = BUILD_VECTOR t13, t16
t7: ch = WebAssemblyISD::RETURN t0, t17
but v4f32 to v4f64 extending loads were lowered like this:
SelectionDAG has 14 nodes:
t0: ch = EntryToken
t2: i32 = WebAssemblyISD::ARGUMENT TargetConstant:i32<0>
t6: ch = CopyToReg t0, Register:i32 %0, t2
t7: i32 = Constant<0>
t4: i32 = WebAssemblyISD::ARGUMENT TargetConstant:i32<1>
t9: v4f32,ch = load<(load (s128) from %ir.p)> t6, t4, undef:i32
t10: v4f64 = fp_extend t9
t11: i32,ch = CopyFromReg t0, Register:i32 %0
t12: ch = store<(store (s256))> t6, t10, t11, undef:i32
t13: ch = WebAssemblyISD::RETURN t12
...
Optimized type-legalized selection DAG: %bb.0 'load_promote_v4f64:'
SelectionDAG has 19 nodes:
t0: ch = EntryToken
t4: i32 = WebAssemblyISD::ARGUMENT TargetConstant:i32<1>
t2: i32 = WebAssemblyISD::ARGUMENT TargetConstant:i32<0>
t6: ch = CopyToReg t0, Register:i32 %0, t2
t11: i32,ch = CopyFromReg t0, Register:i32 %0
t17: v2f64,ch = load<(load (s64) from %ir.p, align 16), anyext from v2f32> t6, t4, undef:i32
t22: ch = store<(store (s128), align 32)> t6, t17, t11, undef:i32
t19: i32 = add nuw t4, Constant:i32<8>
t20: v2f64,ch = load<(load (s64) from %ir.p + 8, basealign 16), anyext from v2f32> t6, t19, undef:i32
t24: i32 = add nuw t11, Constant:i32<16>
t25: ch = store<(store (s128) into unknown-address + 16, basealign 32)> t6, t20, t24, undef:i32
t26: ch = TokenFactor t22, t25
t13: ch = WebAssemblyISD::RETURN t26
So the original loads get split up in both cases, but the reason is different. For the narrower vectors, the v2f32 is not a legal type so the load gets split by type legalization, but for the wider vectors, it's the v4f64 result of the fp_extend that gets split up with the load in the operand position. I guess the different splitting code paths there lead to completely different DAGs later.
>> This regresses the addressing modes supported for
>> the extloads not produced by legalization, but that's a fine trade off for now.
>
> Can you give an example of this case?
All of the new load_promote tests in simd-offset.ll have to explicitly calculate the load offset rather than just use the offset immediate on the load. Before this change they would use the offset immediate.
================
Comment at: llvm/lib/Target/WebAssembly/WebAssemblyInstrSIMD.td:1282
+// Lower extending loads to load64_zero + promote_low
+def extloadv2f32 : PatFrag<(ops node:$ptr), (unindexedload node:$ptr)> {
+ let IsLoad = true;
----------------
aheejin wrote:
> Any reason for using `unindexedload` instead of `load`?
`load` sets `IsNonExtLoad = true`, which is not what we want here. Apparently an "unindexed" load is a normal load and the other kinds of loads also perform address computations and return those results in the same instruction.
================
Comment at: llvm/lib/Target/WebAssembly/WebAssemblyInstrSIMD.td:1283
+def extloadv2f32 : PatFrag<(ops node:$ptr), (unindexedload node:$ptr)> {
+ let IsLoad = true;
+ let IsAnyExtLoad = true;
----------------
aheejin wrote:
> Is this necessary? `unindexedload` already sets `IsLoad` to `true`.
I believe so. We are inheriting from `PatFrag`, not from `unindexedload` directly, and I think the whole new fragment has to be marked as a load. Looking at the definitions of e.g. extloadf32, though, we could use `extload` instead of `unindexedload` and get rid of the `IsAnyExtLoad = true` line.
================
Comment at: llvm/test/CodeGen/WebAssembly/simd-load-promote-wide.ll:29
+ ret <4 x double> %v
+}
+
----------------
aheejin wrote:
> Can we do a similar optimization for integer vectors, such as extending `<4 x i32>` to `<4 x i64>`, using `v128.load64_zero` followed by `i64x2.extend_low_i32x4_s/u` (and for other integer types too)? Or is it possible to use `v128.load` to load everything at once and use `i64x.2_extend_low_...` and `i64x2.extend_high_...` to extend each part?
>
> I'm just asking for a possibility and of course don't mean we do that in this CL ;)
>
Yes, there is room to further optimize the integer version of this. Right now <4 x i32> to <4 x i64> extending loads are lowered to a pattern of one v128.load followed by some shuffles and vector shifts. That doesn't seem too bad, but it would be better to use the extend_low and extend_high instructions as you suggest instead.
Even here it might be better to do a v128.load followed by a promote_low, shuffle, and second promote_low rather than doing two 128.load64_zero. I'll check with the SIMD experts on that...
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D108496/new/
https://reviews.llvm.org/D108496
More information about the llvm-commits
mailing list