[PATCH] D56118: [ARM]: Add optimized NEON uint64x2_t multiply routine.

Sat Dec 29 17:55:32 PST 2018

easyaspi314 planned changes to this revision.
easyaspi314 marked 3 inline comments as done.
easyaspi314 added a comment.

I am going to add constant interleaving, more tests, and call it done. I think that `twomul` and interleaved loads are much smaller and lower priority optimizations, and something I am not comfortable enough to do myself.

================
Comment at: lib/Target/ARM/ARMISelLowering.cpp:7465
+        KnownBits topKnown = DAG.computeKnownBits(top);
+        KnownBits botKnown = DAG.computeKnownBits(bot);
+
----------------
efriedma wrote:
> Alternatively, we could try to handle simplifications by adding a DAGCombine for VSHRu and VMULLu nodes with constant operands, instead of trying to avoid emitting them.  The end result is roughly the same (it's probably unlikely to construct a trivial VSHRu or VMULLu any other way), but it might be easier to understand.  But I guess it would get tricky if we tried to emit the form using "vrev64".  Probably okay as-is.
> 
> Needs more test coverage for various cases where the upper/lower half an operand is zero.
This is the exact same method used in X86ISelLowering.cpp/LowerMUL, except for the early return to work around missing optimizations later on.

If we use the `twomul`/`vrev64` version, we would only use it if none of them are zero and none could be pre-interleaved. Once things get interleaved and/or we eliminate one of the multiplies, we lose the benefit.

================
Comment at: lib/Target/ARM/ARMTargetTransformInfo.cpp:528
+    { ISD::UMULO, MVT::v2i64, 11},
+    { ISD::SMULO, MVT::v2i64, 11},
   };
----------------
efriedma wrote:
> Did you really mean UMULO (llvm.umul.with.overflow)?  I mean, I guess it's theoretically possible to implement, but I think we'll just scalarize it at the moment.
> 
> (The cost model changes need test coverage in test/Analysis/CostModel/ARM/ .)
Oh yeah, I did that wrong. Should be normal multiply.

================
Comment at: test/CodeGen/ARM/vmul.ll:69
+;CHECK: vld1.64         {d20, d21}, [r0]
+;CHECK; vand            q8, q10, q8
+;CHECK: vmovn.i64       d18, q9
----------------
craig.topper wrote:
> efriedma wrote:
> > DAGCombine should be able to catch the redundant AND... but it looks like DAGCombiner::visitTRUNCATE doesn't try to handle demanded bits for vectors.  (I guess it didn't get updated when other operations got support for vector operands?)
> PR39689 mentions this is disabled for vectors. Maybe @rksimon or @spatel are working on it?
That would explain why it was choking on this when X86 does not.

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D56118/new/

https://reviews.llvm.org/D56118