<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><br class=""><div><blockquote type="cite" class=""><div class="">On Jun 12, 2017, at 17:23, Tom Stellard <<a href="mailto:tstellar@redhat.com" class="">tstellar@redhat.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">On 06/12/2017 08:03 PM, Connor Abbott wrote:</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><blockquote type="cite" style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px;" class="">On Mon, Jun 12, 2017 at 4:56 PM, Tom Stellard <<a href="mailto:tstellar@redhat.com" class="">tstellar@redhat.com</a>> wrote:<br class=""><blockquote type="cite" class="">On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote:<br class=""><blockquote type="cite" class="">cc some people who have worked on this.<br class=""><br class="">On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote:<br class=""><blockquote type="cite" class="">Hi all,<br class=""><br class="">I've been looking into how to implement the more advanced Shader Model<br class="">6 reduction operations in radv (and obviously most of the work would<br class="">be useful for radeonsi too). They're explained in the spec for<br class="">GL_AMD_shader_ballot at<br class=""><a href="https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt" class="">https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt</a>,<br class="">but I'll summarize them here. There are two types of operations:<br class="">reductions that always return a uniform value, and prefix scan<br class="">operations. The reductions can be implemented in terms of the prefix<br class="">scan (although in practice I don't think we want to implement them in<br class="">exactly the same way), and the concerns are mostly the same, so I'll<br class="">focus on the prefix scan operations for now. Given an operation `op'<br class="">and an input value `a' (that's really a SIMD array with one value per<br class="">invocation, even though it's a scalar value in LLVM), the prefix scan<br class="">returns a[0] in invocation 0, a[0] `op' a[1] in invocation 1, a[0]<br class="">`op' a[1] `op' a[2] in invocation 2, etc. The prefix scan will also<br class="">work for non-uniform control flow: it simply skips inactive<br class="">invocations.<br class=""><br class="">On the LLVM side, I think that we have most of the AMD-specific<br class="">low-level shuffle intrinsics implemented that you need to do this, but<br class="">I can think of a few concerns/questions. First of all, to implement<br class="">the prefix scan, we'll need to do a code sequence that looks like<br class="">this, modified from<br class=""><a href="http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/" class="">http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/</a> (replace<br class="">v_foo_f32 with the appropriate operation):<br class=""><br class="">; v0 is the input register<br class="">v_mov_b32 v1, v0<br class="">v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1<br class="">v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2<br class="">v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3<br class="">v_nop // Add two independent instructions to avoid a data hazard<br class="">v_nop<br class="">v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4<br class="">v_nop // Add two independent instructions to avoid a data hazard<br class="">v_nop<br class="">v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5<br class="">v_nop // Add two independent instructions to avoid a data hazard<br class="">v_nop<br class="">v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6<br class="">v_nop // Add two independent instructions to avoid a data hazard<br class="">v_nop<br class="">v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7<br class=""><br class="">The problem is that the way these instructions use the DPP word isn't<br class="">currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp<br class="">intrinsic, but it isn't enough. For example, take the first<br class="">instruction:<br class=""><br class="">v_foo_f32 v1, v0, v1 row_shr:1<br class=""><br class="">What it's doing is shifting v0 right by one within each row and adding<br class="">it to v1. v1 stays the same in the first lane of each row, however.<br class="">With llvm.amdgcn.mov_dpp, we could try to express it as something like<br class="">this, in LLVM-like pseduocode:<br class=""><br class="">%tmp = call llvm.amdgcn.mov_dpp %input row_shr:1<br class="">%result = foo %tmp, %input<br class=""><br class="">but this is incorrect. If I'm reading the source correctly, this will<br class="">make %tmp garbage in lane 0 (since it just turns into a normal move<br class="">with the dpp modifier, and no restrictions on the destination). We<br class="">could set bound_ctrl to 0 to work around this, since it will make %tmp<br class="">0 in lane 0, but that won't work with operations whose identity is<br class="">non-0 like min and max. What we need is something like:<br class=""><br class=""></blockquote></blockquote><br class="">Why is %tmp garbage?  I thought the two options were 0 (bound_ctrl =0)<br class="">or %input (bound_ctrl = 1)?<br class=""></blockquote><br class="">Oh, maybe it is... for that to happen the underlying move would need<br class="">to have the source and destination constrained to be the same. I<br class="">couldn't see that constraint anywhere I looked, but I'm not an expert,<br class="">so I may have overlooked it. In any case, that behavior still isn't<br class="">what we want if we want to implement the prefix scan operations<br class="">efficiently.<br class=""><br class=""></blockquote><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">Ok, I see what you are saying now.  I think the best option here is to</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">document that the behavior of the llvm.amdgcn.mov.dpp intrinsic is to</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">copy its src operand to dst when bound_ctrl = 1 and it reads from an invalid</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">thread, and then when bound_ctrl=1, lower the intrinsic to a special tied version</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">of V_MOV_B32_dpp where the src and dst are the same register.</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">-Tom</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""></div></blockquote><br class=""></div><div>This came up before that there’s no way to represent the unmodified input register for the inactive lanes. I think the conclusion was that a new intrinsic is needed to represent this case but I don’t think there was a consensus on what it should look like.</div><div><br class=""></div><div>-Matt</div><br class=""></body></html>