[PATCH] D46179: [X86] Lowering addus/subus intrinsics to native IR (LLVM part)

Thu Aug 23 18:39:19 PDT 2018

srj added a comment.

I'm trying to update Halide to generate IR that will be recognized by this patch (instead of calling the now-deprecated intrinsics), but having trouble with a somewhat degenerate-but-legal case.

If user code specifies a non-native vector width (e.g., paddusb with 8 lanes on sse2, instead of the native with of 16 lanes), our code would handle this by loading at the size requested, then combining vectors to the right native size. So our code would formerly emit something like

  # Do a saturating unsigned add on two <8 x i8> vectors, 
  # then widen to an <8 x i32> result
  %20 = load <8 x i8>
  %21 = load <8 x i8>
  %22 = shufflevector <8 x i8> %20, <8 x i8> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
  %23 = shufflevector <8 x i8> %21, <8 x i8> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
  %24 = call <16 x i8> @llvm.x86.sse2.psubus.b(<16 x i8> %22, <16 x i8> %23) #5
  %25 = shufflevector <16 x i8> %24, <16 x i8> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>

To work with this patch, I revised our code to emit inline code that should pattern-match properly (based on the new self-tests for the IR), something like:

  %20 = load <8 x i8>
  %21 = load <8 x i8>
  %22 = shufflevector <8 x i8> %20, <8 x i8> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
  %23 = shufflevector <8 x i8> %21, <8 x i8> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
  # Here's the inline pattern that should match paddusb
  %24 = add <16 x i8> %22, %23
  %25 = icmp ugt <16 x i8> %22, %24
  %26 = select <16 x i1> %25, <16 x i8> <i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>, <16 x i8> %24
  #
  %25 = shufflevector <16 x i8> %26, <16 x i8> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>

  And, in fact, if I don't use any optimizer passes, this works perfectly. Unfortunately, the LLVM optimizer passes can do some rearranging of this, e.g. into a form something like this:

  %20 = load <8 x i8>
  %21 = load <8 x i8>
  %22 = add <8 x i8> %20, %16
  %23 = shufflevector <8 x i8> %22, <8 x i8> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
  %24 = icmp ult <8 x i8> %22, %20
  %25 = shufflevector <8 x i1> %24, <8 x i1> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
  %26 = select <16 x i1> %25, <16 x i8> <i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef>, <16 x i8> %23
  %27 = shufflevector <16 x i8> %26, <16 x i8> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>

  ...which no longer gets recognized as a pattern that produces paddusb, since the select no longer refers directly to the result of the compare (but rather to an intermediate shuffle). 

  Disabling all our optimizer passes 'fixes' this but that's obviously unsuitable as a solution. Could this pattern matching be made more robust? 

Repository:
  rL LLVM

https://reviews.llvm.org/D46179