[PATCH] D56118: [ARM]: Add optimized NEON uint64x2_t multiply routine.
easyaspi314 (Devin) via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Thu Jan 3 09:29:10 PST 2019
easyaspi314 added a comment.
`vmul.i32 Qd,Qn,Qm` actually takes //4 cycles//, which means `twomul` has the same timing as `ssemul`, 11 cycles.
@efriedma that explains why `twomul` wasn't visibly faster in my tests.
However, `twomul` saves an instruction, so I am keeping it.
twomul:
vrev64.32 q10, q8 @ 1 cycle, total 1
vmovn.i64 d16, q8 @ 1 cycle, total 2
vmovn.i64 d17, q9 @ 1 cycle, total 3
vmul.i32 q10, q10, q9 @ 4 cycles, total 7
vpaddl.u32 q10, q10 @ 1 cycle, total 8
vshl.i64 q9, q10, #32 @ 1 cycle, total 9
vmlal.u32 q9, d17, d16 @ 2 cycles, total 11
ssemul:
vshrn.i64 d20, q8, #32 @ 1 cycle, total 1
vmovn.i64 d16, q8 @ 1 cycle, total 2
vmovn.i64 d21, q9 @ 1 cycle, total 3
vmull.u32 q11, d21, d20 @ 2 cycles, total 5
vshrn.i64 d17, q9, #32 @ 1 cycle, total 6
vmlal.u32 q11, d17, d16 @ 2 cycles, total 7
vshl.i64 q9, q11, #32 @ 1 cycle, total 8
vmlal.u32 q9, d21, d16 @ 2 cycles, total 11
Repository:
rL LLVM
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D56118/new/
https://reviews.llvm.org/D56118
More information about the llvm-commits
mailing list