[llvm] r286171 - [AMDGPU] Allow hoisting of comparisons out of a loop and eliminate condition copies
Mekhanoshin, Stanislav via llvm-commits
llvm-commits at lists.llvm.org
Thu Nov 10 19:36:49 PST 2016
I have reverted the change in rL286530. I still do not see any issue with the code and cannot explain the failure.
Michael, what GPU do you use?
Stas
-----Original Message-----
From: Mekhanoshin, Stanislav
Sent: Thursday, November 10, 2016 2:26 PM
To: Tom Stellard
Cc: Michel Dänzer; llvm-commits at lists.llvm.org; Nicolai Hähnle; Marek Olšák; Matt Arsenault
Subject: RE: [llvm] r286171 - [AMDGPU] Allow hoisting of comparisons out of a loop and eliminate condition copies
Well, actually according to the source IR this value is not being modified:
%56 = and i1 %52, %55
%57 = select i1 %56, float -1.000000e+00, float 0.000000e+00
call void @llvm.AMDGPU.kill(float %57)
br label %loop17
loop17: ; preds = %endif21, %main_body
%58 = phi i32 [ 0, %main_body ], [ %62, %endif21 ]
%59 = phi i32 [ 0, %main_body ], [ %62, %endif21 ]
%TEMP3.z.0 = phi float [ 0.000000e+00, %main_body ], [ %61, %endif21 ]
%60 = icmp sgt i32 %58, 9
br i1 %60, label %endloop28, label %endif21
endif21: ; preds = %loop17
%61 = fadd float %TEMP3.z.0, 0x3FB99999A0000000
%62 = add i32 %59, %47
br i1 %56, label %endloop28, label %loop17
This register is the value %56 defined in the entry block and used once inside the loop (last br instruction in the snippet). That said eliminating two COPY instructions shall be valid and in fact directly corresponds the input IR: just use the pair of SGPRs as defined in the entry block.
I.e. the reason for failure is still unclear.
I think I have to revert the change until the explanation is found.
Stas
-----Original Message-----
From: Mekhanoshin, Stanislav
Sent: Thursday, November 10, 2016 2:01 PM
To: 'Tom Stellard' <tom at stellard.net>
Cc: Michel Dänzer <michel at daenzer.net>; llvm-commits at lists.llvm.org; Nicolai Hähnle <nhaehnle at gmail.com>; Marek Olšák <maraeo at gmail.com>; Matt Arsenault <arsenm2 at gmail.com>
Subject: RE: [llvm] r286171 - [AMDGPU] Allow hoisting of comparisons out of a loop and eliminate condition copies
You are probably right... At least it looks like the only logical explanation. I though thou that v_cmp instruction would keep bits for inactive lanes, but I cannot find a definitive answer.
Stas
-----Original Message-----
From: Tom Stellard [mailto:tom at stellard.net]
Sent: Thursday, November 10, 2016 1:54 PM
To: Mekhanoshin, Stanislav <Stanislav.Mekhanoshin at amd.com>
Cc: Michel Dänzer <michel at daenzer.net>; llvm-commits at lists.llvm.org; Nicolai Hähnle <nhaehnle at gmail.com>; Marek Olšák <maraeo at gmail.com>; Matt Arsenault <arsenm2 at gmail.com>
Subject: Re: [llvm] r286171 - [AMDGPU] Allow hoisting of comparisons out of a loop and eliminate condition copies
On Thu, Nov 10, 2016 at 07:15:44AM +0000, Mekhanoshin, Stanislav wrote:
> Thank you. In this case v2 is used instead of v1 and do not contribute to export... I'm still puzzled what can go wrong with this code.
>
Good asm is on the left and bad asm is on the right:
s_and_b64 s[0:1], s[0:1], vcc s_and_b64 s[0:1],s[0:1], vcc
v_cndmask_b32_e64 v1, 0, -1, s[0:1] <
v_mov_b32_e32 v2, 0 v_mov_b32_e32 v2, 0
v_cndmask_b32_e64 v3, 0, -1.0, s[0:1] | v_cndmask_b32_e64 v1, 0, -1.0, s[0:1]
v_cmpx_le_f32_e32 vcc, 0, v3 | v_cmpx_le_f32_e32 vcc, 0, v1
s_cbranch_execnz BB0_2 s_cbranch_execnz BB0_2
exp 0, 9, 0, 1, 1, v0, v0, v0, v0 exp 0, 9, 0, 1, 1, v0, v0, v0, v0
s_endpgm s_endpgm
v_sub_i32_e32 v3, vcc, 0, v0 | v_sub_i32_e32 v1, vcc, 0, v0
s_mov_b64 s[2:3], 0 s_mov_b64 s[2:3], 0
v_add_i32_e32 v3, vcc, v3, v0 | v_add_i32_e32 v1, vcc, v1, v0
v_cmp_ne_u32_e64 s[0:1], 0, v1 | v_cmp_gt_i32_e32 vcc, 10, v1
v_cmp_gt_i32_e32 vcc, 10, v3 <
The value of s[0:1] at the end of this block is the difference between the good and bad asm. This is because the output of v_cmp instructions is dependent on exec, so this sequence isn't doing a true copy of s[0:1] because the value of exex is modified in the middle:
v_cndmask_b32_e64 v1, 0, -1, s[0:1]
...
v_cmpx_le_f32_e32 vcc, 0, v3
...
v_cmp_ne_u32_e64 s[0:1], 0, v1
I think the fix is to only do the forward propagation if exec isn't being modified between between the def and use of the sgpr.
-Tom
> Stas
>
> -----Original Message-----
> From: Michel Dänzer [mailto:michel at daenzer.net]
> Sent: Wednesday, November 09, 2016 10:56 PM
> To: Mekhanoshin, Stanislav
> Cc: llvm-commits at lists.llvm.org; Nicolai Hähnle; Marek Olšák; Matt
> Arsenault; Tom Stellard
> Subject: Re: [llvm] r286171 - [AMDGPU] Allow hoisting of comparisons
> out of a loop and eliminate condition copies
>
> On 10/11/16 03:36 PM, Mekhanoshin, Stanislav wrote:
> > I would really appreciate if someone familiar with GLSL could explain the problem. Here is the only difference in code produced:
> >
> > 303d302
> > < v_cndmask_b32_e64 v1, 0, -1, s[0:1] ; D2000001 00018280
> > 305,306c304,305
> > < v_cndmask_b32_e64 v3, 0, -1.0, s[0:1] ; D2000003 0001E680
> > < v_cmpx_le_f32_e32 vcc, 0, v3 ; 7C260680
> > ---
> >> v_cndmask_b32_e64 v1, 0, -1.0, s[0:1] ; D2000001 0001E680
> >> v_cmpx_le_f32_e32 vcc, 0, v1 ; 7C260280
> > 310c309
> > < v_sub_i32_e32 v3, vcc, 0, v0 ; 4C060080
> > ---
> >> v_sub_i32_e32 v1, vcc, 0, v0 ; 4C020080
> > 312,314c311,312
> > < v_add_i32_e32 v3, vcc, v3, v0 ; 4A060103
> > < v_cmp_ne_u32_e64 s[0:1], 0, v1 ; D18A0000 00020280
> > < v_cmp_gt_i32_e32 vcc, 10, v3 ; 7D08068A
> > ---
> >> v_add_i32_e32 v1, vcc, v1, v0 ; 4A020101
> >> v_cmp_gt_i32_e32 vcc, 10, v1 ; 7D08028A
> >
> > So, what happens is this instruction is removed:
> >
> > v_cndmask_b32_e64 v1, 0, -1, s[0:1]
> >
> > it copies s[0:1] into v1 for the lane.
> >
> > Then the instruction which restores s[0:1] from v1 is also removed:
> >
> > v_cmp_ne_u32_e64 s[0:1], 0, v1
> >
> > Neither s0 nor s1 are written in between and anywhere after this point. Since v1 is now free other modified instructions use v1 instead of v3, which again does not seem to be an issue for me.
> > The only difference I can see is the contents of v1 and v3 upon kernel termination in case if discard is called... Is there anything is GLSL ABI which requires v1 and v3 to hold specific values on exit? My question comes from this epilogue on non-discard return:
> >
> > v_mov_b32_e32 v0, 0
> > v_mov_b32_e32 v1, 0
> > v_mov_b32_e32 v3, 0
> > v_mov_b32_e32 v13, v15
> > ; return
> >
> > Then this piece of code does not call s_endpgm as well. I also do not see branch target BB0_2 used in the generated code and generally have a suspicion this is just a part of a bigger kernel (based on the absence of s_endpgm at the end). I.e. there can be potentially a problem if this code is just inserted somewhere in a bigger context, not visible to the compiler.
>
> The dumps were from radeonsi's default mode using separate shader prologues and epilogues. Attached to this e-mail are the corresponding dumps using monolithic shaders. As you can see, v1 contributes to the colour value exported from the pixel shader.
>
>
> --
> Earthling Michel Dänzer | http://www.amd.com
> Libre software enthusiast | Mesa and X developer
More information about the llvm-commits
mailing list