[PATCH] D33572: [PPC] Implement fast bit reverse in PPCDAGToDAGISel

Hal Finkel via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Tue May 30 19:28:12 PDT 2017


hfinkel requested changes to this revision.
hfinkel added a comment.
This revision now requires changes to proceed.

I'd rather not address the problem this way. Can we canonicalize the code sequence at the IR level into the @llvm.bitreverse intrinsic, and then match the intrinsic efficiently in the backend (preferably in TableGen, where complicated patterns are more succinct to write)? This will give us the additional advantage of good lowering for @llvm.bitreverse and the opportunity for IR-level optimizations to deal with the canonical representation. What the bit permutation selector is doing should be relatively easy to replicate at the IR level.

The code sequence is pretty standard, and I'd love to generalize it, but it is not entirely clear how to usefully suggest doing so in this framework. It is doing the bit reversal by first interchanging adjacent bits, then adjacent bit pairs, etc. Straight out of Hacker's Delight :-) We can do this without a completely-serial dependency chain only because of the complete symmetry of the reversal operation. The bit-permutation selector could certainly recognize "partial reversals", and integrate using a sequence like this for those parts of the overall permutation. I'm not sure how worthwhile this would be.

There is also a larger issue potentially worth discussing. The code we currently produce looks like this:

  	rlwinm 4, 3, 1, 0, 31 
  	rlwimi 4, 3, 3, 30, 30
  	rlwimi 4, 3, 5, 29, 29
  	rlwimi 4, 3, 7, 28, 28
  	rlwimi 4, 3, 9, 27, 27
  	rlwimi 4, 3, 11, 26, 26
  	rlwimi 4, 3, 13, 25, 25
  	rlwimi 4, 3, 15, 24, 24
  	rlwimi 4, 3, 17, 23, 23
  	rlwimi 4, 3, 19, 22, 22
  	rlwimi 4, 3, 21, 21, 21
  	rlwimi 4, 3, 23, 20, 20
  	rlwimi 4, 3, 25, 19, 19
  	rlwimi 4, 3, 27, 18, 18
  	rlwimi 4, 3, 29, 17, 17
  	rlwimi 4, 3, 31, 16, 16
  	rlwimi 4, 3, 3, 14, 14
  	rlwimi 4, 3, 5, 13, 13
  	rlwimi 4, 3, 7, 12, 12
  	rlwimi 4, 3, 9, 11, 11
  	rlwimi 4, 3, 11, 10, 10
  	rlwimi 4, 3, 13, 9, 9
  	rlwimi 4, 3, 15, 8, 8
  	rlwimi 4, 3, 17, 7, 7
  	rlwimi 4, 3, 19, 6, 6
  	rlwimi 4, 3, 21, 5, 5
  	rlwimi 4, 3, 23, 4, 4
  	rlwimi 4, 3, 25, 3, 3
  	rlwimi 4, 3, 27, 2, 2
  	rlwimi 4, 3, 29, 1, 1
  	rlwimi 4, 3, 31, 0, 0
  	mr 3, 4
  	blr

and that's one large dependency chain (each instruction updating r4). It also clearly does not need to be that way. We could insert the reversed bits into n registers, as they're all independent, and then 'and' the results together at the end. In this way, we could create lots of independent streams of computation. Have you experimented with whether this is faster than the original sequence on the https://reviews.llvm.org/P8? If it is, then I'll partially take back what I said about putting a pattern in TableGen, and recommend that we implement dependency-chain splitting in the bit-permutation selector (and rank options by taking throughput into account instead of just counting instructions), or alternatively, implement dependency-chain splitting in the MachineCombiner.



================
Comment at: test/CodeGen/PowerPC/pr33093.ll:37
+; CHECK-LABEL: @ReverseBits
+; CHECK:       and
+; CHECK:       and
----------------
Please check for the desired sequence here, including regex-recognized operands. Same below.


https://reviews.llvm.org/D33572





More information about the llvm-commits mailing list