[PATCH] D108496: [WebAssembly] Lower v2f32 to v2f64 extending loads with promote_low

Wed Aug 25 16:43:13 PDT 2021

tlively marked an inline comment as done.
tlively added a comment.

In D108496#2958980 <https://reviews.llvm.org/D108496#2958980>, @aheejin wrote:

>> Previously extra wide v4f32 to v4f64 extending loads would be legalized to v2f32
>> to v2f64 extending loads, which would then be scalarized by legalization. (v2f32
>> to v2f64 extending loads not produced by legalization were already being emitted
>> correctly.)
>
> Why v2f32 to v2f64 extending loads not produced by legalization are currently handled fine but not the one produced by legalization? I guess I lack the knowledge of order of transformation within isel..

v2f32 to v2f64 extending loads were previously lowered like this:

  SelectionDAG has 7 nodes:
    t0: ch = EntryToken
          t2: i32 = WebAssemblyISD::ARGUMENT TargetConstant:i32<0>
        t5: v2f32,ch = load<(load (s64) from %ir.p)> t0, t2, undef:i32
      t6: v2f64 = fp_extend t5
    t7: ch = WebAssemblyISD::RETURN t0, t6

  ...

  Optimized type-legalized selection DAG: %bb.0 'load_promote_v2f62:'
  SelectionDAG has 15 nodes:
    t0: ch = EntryToken
          t2: i32 = WebAssemblyISD::ARGUMENT TargetConstant:i32<0>
        t8: i64,ch = load<(load (s64) from %ir.p)> t0, t2, undef:i32
      t9: v2i64 = scalar_to_vector t8
    t10: v4f32 = bitcast t9
          t12: f32 = extract_vector_elt t10, Constant:i32<0>
        t13: f64 = fp_extend t12
          t15: f32 = extract_vector_elt t10, Constant:i32<1>
        t16: f64 = fp_extend t15
      t17: v2f64 = BUILD_VECTOR t13, t16
    t7: ch = WebAssemblyISD::RETURN t0, t17

but v4f32 to v4f64 extending loads were lowered like this:

  SelectionDAG has 14 nodes:
    t0: ch = EntryToken
      t2: i32 = WebAssemblyISD::ARGUMENT TargetConstant:i32<0>
    t6: ch = CopyToReg t0, Register:i32 %0, t2
    t7: i32 = Constant<0>
            t4: i32 = WebAssemblyISD::ARGUMENT TargetConstant:i32<1>
          t9: v4f32,ch = load<(load (s128) from %ir.p)> t6, t4, undef:i32
        t10: v4f64 = fp_extend t9
        t11: i32,ch = CopyFromReg t0, Register:i32 %0
      t12: ch = store<(store (s256))> t6, t10, t11, undef:i32
    t13: ch = WebAssemblyISD::RETURN t12

  ...

  Optimized type-legalized selection DAG: %bb.0 'load_promote_v4f64:'
  SelectionDAG has 19 nodes:
    t0: ch = EntryToken
    t4: i32 = WebAssemblyISD::ARGUMENT TargetConstant:i32<1>
      t2: i32 = WebAssemblyISD::ARGUMENT TargetConstant:i32<0>
    t6: ch = CopyToReg t0, Register:i32 %0, t2
    t11: i32,ch = CopyFromReg t0, Register:i32 %0
          t17: v2f64,ch = load<(load (s64) from %ir.p, align 16), anyext from v2f32> t6, t4, undef:i32
        t22: ch = store<(store (s128), align 32)> t6, t17, t11, undef:i32
            t19: i32 = add nuw t4, Constant:i32<8>
          t20: v2f64,ch = load<(load (s64) from %ir.p + 8, basealign 16), anyext from v2f32> t6, t19, undef:i32
          t24: i32 = add nuw t11, Constant:i32<16>
        t25: ch = store<(store (s128) into unknown-address + 16, basealign 32)> t6, t20, t24, undef:i32
      t26: ch = TokenFactor t22, t25
    t13: ch = WebAssemblyISD::RETURN t26

So the original loads get split up in both cases, but the reason is different. For the narrower vectors, the v2f32 is not a legal type so the load gets split by type legalization, but for the wider vectors, it's the v4f64 result of the fp_extend that gets split up with the load in the operand position. I guess the different splitting code paths there lead to completely different DAGs later.

>> This regresses the addressing modes supported for
>> the extloads not produced by legalization, but that's a fine trade off for now.
>
> Can you give an example of this case?

All of the new load_promote tests in simd-offset.ll have to explicitly calculate the load offset rather than just use the offset immediate on the load. Before this change they would use the offset immediate.

================
Comment at: llvm/lib/Target/WebAssembly/WebAssemblyInstrSIMD.td:1282
+// Lower extending loads to load64_zero + promote_low
+def extloadv2f32 : PatFrag<(ops node:$ptr), (unindexedload node:$ptr)> {
+  let IsLoad = true;
----------------
aheejin wrote:
> Any reason for using `unindexedload` instead of `load`?
`load` sets `IsNonExtLoad = true`, which is not what we want here. Apparently an "unindexed" load is a normal load and the other kinds of loads also perform address computations and return those results in the same instruction.

================
Comment at: llvm/lib/Target/WebAssembly/WebAssemblyInstrSIMD.td:1283
+def extloadv2f32 : PatFrag<(ops node:$ptr), (unindexedload node:$ptr)> {
+  let IsLoad = true;
+  let IsAnyExtLoad = true;
----------------
aheejin wrote:
> Is this necessary? `unindexedload` already sets `IsLoad` to `true`.
I believe so. We are inheriting from `PatFrag`, not from `unindexedload` directly, and I think the whole new fragment has to be marked as a load. Looking at the definitions of e.g. extloadf32, though, we could use `extload` instead of `unindexedload` and get rid of the `IsAnyExtLoad = true` line.

================
Comment at: llvm/test/CodeGen/WebAssembly/simd-load-promote-wide.ll:29
+  ret <4 x double> %v
+}
+
----------------
aheejin wrote:
> Can we do a similar optimization for integer vectors, such as extending `<4 x i32>` to `<4 x i64>`, using `v128.load64_zero` followed by `i64x2.extend_low_i32x4_s/u` (and for other integer types too)? Or is it possible to use `v128.load` to load everything at once and use `i64x.2_extend_low_...` and `i64x2.extend_high_...` to extend each part?
> 
> I'm just asking for a possibility and of course don't mean we do that in this CL ;)
> 
Yes, there is room to further optimize the integer version of this. Right now <4 x i32> to <4 x i64> extending loads are lowered to a pattern of one v128.load followed by some shuffles and vector shifts. That doesn't seem too bad, but it would be better to use the extend_low and extend_high instructions as you suggest instead.

Even here it might be better to do a v128.load followed by a promote_low, shuffle, and second promote_low rather than doing two 128.load64_zero. I'll check with the SIMD experts on that...

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D108496/new/

https://reviews.llvm.org/D108496