[PATCH] D55444: AMDGPU: Fix DPP combiner

Thu Jan 10 05:34:17 PST 2019

cwabbott added a comment.

Sorry, I just got back from break this week. I've run CTS with the pass enabled, and it now passes, although it seems most of the patterns we use don't get folded. Firstly AND, XOR, unsigned max, and unsigned min are most troubling, since the code that gets generated looks like it should be optimized:

  	v_mov_b32_e32 v8, -1                                          ; 7E1002C1
  	s_nop 1                                                       ; BF800001
  	v_mov_b32_dpp v8, v2  row_bcast:15 row_mask:0xa bank_mask:0xf ; 7E1002FA AF014202

as well as

  	v_mov_b32_e32 v8, 0                                                         ; 7E100280
  	s_nop 1                                                                     ; BF800001
  	v_mov_b32_dpp v8, v2  row_shr:8 row_mask:0xf bank_mask:0xc                  ; 7E1002FA FC011802
  	v_max_u32_e32 v2, v2, v8                                                    ; 1E041102

and

  	v_mov_b32_e32 v8, -1                                          ; 7E1002C1
  	s_nop 1                                                       ; BF800001
  	v_mov_b32_dpp v8, v2  row_bcast:15 row_mask:0xa bank_mask:0xf ; 7E1002FA AF014202
  	v_min_u32_e32 v2, v2, v8                                      ; 1C041102

and finally

  	v_mov_b32_e32 v8, 0                                                         ; 7E100280
  	s_nop 1                                                                     ; BF800001
  	v_mov_b32_dpp v8, v2  row_bcast:15 row_mask:0xa bank_mask:0xf               ; 7E1002FA AF014202
  	v_xor_b32_e32 v2, v2, v8                                                    ; 2A041102

Anyways, the other cases look like maybe some other clever optimization for the immediate is hindering this one, for example this with signed minimum:

  	v_bfrev_b32_e32 v8, -2                                        ; 7E1058C2
  	s_nop 1                                                       ; BF800001
  	v_mov_b32_dpp v8, v2  row_bcast:15 row_mask:0xa bank_mask:0xf ; 7E1002FA AF014202
  	v_min_i32_e32 v2, v2, v8                                      ; 18041102

Maybe this pass needs to be moved earlier in the pipeline?

I'll take a more detailed look soon. But I'd also like to say that I'm a little worried that there are currently no integration tests that use DPP besides the Vulkan CTS, so there is currently no way for anyone working on this from the ROCm/compute side to test this pass properly, which means that none of you can work on this with any confidence. I think this exchange has clearly shown that unit tests aren't enough, since they only show that the code does what you think it does, not whether what you think it does is correct :) Maybe we should fix the atomic optimizer first so that it does something similar to what radv and AMDVLK do, to make it easier to test? Or maybe you just need to go write some tests.

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D55444/new/

https://reviews.llvm.org/D55444