[PATCH] D56118: [ARM]: Add optimized NEON uint64x2_t multiply routine.

Thu Jan 3 09:29:10 PST 2019

easyaspi314 added a comment.

`vmul.i32 Qd,Qn,Qm` actually takes //4 cycles//, which means `twomul` has the same timing as `ssemul`, 11 cycles. 
@efriedma that explains why `twomul` wasn't visibly faster in my tests.

However, `twomul` saves an instruction, so I am keeping it.

  twomul:
  	vrev64.32	q10, q8        @ 1 cycle,  total 1
  	vmovn.i64	d16, q8        @ 1 cycle,  total 2
  	vmovn.i64	d17, q9        @ 1 cycle,  total 3
  	vmul.i32	q10, q10, q9   @ 4 cycles, total 7
  	vpaddl.u32	q10, q10       @ 1 cycle,  total 8
  	vshl.i64	q9, q10, #32   @ 1 cycle,  total 9
  	vmlal.u32	q9, d17, d16   @ 2 cycles, total 11

  ssemul:
  	vshrn.i64       d20, q8, #32   @ 1 cycle,  total 1
  	vmovn.i64       d16, q8        @ 1 cycle,  total 2
  	vmovn.i64       d21, q9        @ 1 cycle,  total 3
  	vmull.u32       q11, d21, d20  @ 2 cycles, total 5
  	vshrn.i64       d17, q9, #32   @ 1 cycle,  total 6
  	vmlal.u32       q11, d17, d16  @ 2 cycles, total 7
  	vshl.i64        q9, q11, #32   @ 1 cycle,  total 8
  	vmlal.u32       q9, d21, d16   @ 2 cycles, total 11

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D56118/new/

https://reviews.llvm.org/D56118