<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">If I understand right, on this arch, ‘uniform’ refers to values that only take one lane of register file instead of SIMD-width lanes, and they *share* the same region of the register file as non-uniform values. This is in contrast to e.g. AMDGPU where SGPRs (scalar GPRs) and VGPRs are separate register files.<div class=""><br class=""></div><div class="">If this understanding is correct, you may be able to define uniform and non-uniform registers separately, but make sure that one aliases the other, e.g. so that (if your SIMD width is 16) VGPR 20 overlaps SGPR 320, 321….335. So you can have 128 vector registers, 16*128 uniforms, or a mix of the two.</div><div class=""><br class=""></div><div class="">(Maybe some of the AMDGPU maintainers have thoughts?)</div><div class=""><br class=""></div><div class="">—escha<br class=""><div class=""><br class=""><div><blockquote type="cite" class=""><div class="">On Aug 22, 2016, at 8:07 PM, Ruiling Song <<a href="mailto:ruiling.song83@gmail.com" class="">ruiling.song83@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class=""><span style="font-size:14px" class="">Hi </span><span class="" style="font-size:14px;background-color:rgb(255,255,255)">Escha</span><span style="font-size:14px" class="">,</span><div style="font-size:14px" class=""><br class=""></div><div style="font-size:14px" class="">Great to have your comment! Do you have any specific reason for not doing like this?</div><div style="font-size:14px" class="">I am not sure whether I understand your point correctly. For "just model one thread",</div><div style="font-size:14px" class="">do you mean "only considering ONE of the 8/16 working lanes that running in lock-step way"??</div><div style="font-size:14px" class=""><br class=""></div><div style="font-size:14px" class="">For my case, may be something like I only need to define r0~r127 as register for i32 register (each r# is just enough for simd8 i32).</div><div style="font-size:14px" class="">Then the register allocator never need to go to allocate the sub-registers, just operate them as a whole. right?</div><div style="font-size:14px" class=""><br class=""></div><div style="font-size:14px" class="">Yes, it looks really easy for divergent registers. But I think then I would lose the ability</div><div style="font-size:14px" class="">to allocate uniform register. Am I right? Is there any way to allocate uniform register</div><div style="font-size:14px" class="">as well as allocate divergent register?<div class="gmail_extra"><br class=""></div><div class="gmail_extra">Thanks!</div><div class="gmail_extra">Ruiling</div></div><div class="gmail_extra"><br class=""><div class="gmail_quote">2016-08-23 0:32 GMT+08:00  <span dir="ltr" class=""><<a href="mailto:escha@apple.com" target="_blank" class="">escha@apple.com</a>></span>:<br class=""><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word" class=""><br class=""><div class=""><span class=""><blockquote type="cite" class=""><div class="">On Aug 22, 2016, at 6:46 AM, Ruiling Song via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank" class="">llvm-dev@lists.llvm.org</a>> wrote:</div><br class=""><div class=""><div dir="ltr" class=""><div class="">Hello Everyone,</div><div class=""><br class=""></div><div class="">I am trying to make a new LLVM backend target for Intel GPU.<br class="">I would start from targeting OpenCL language first.</div><div class="">But I am not quite familiar with LLVM backend infrastructure.</div><div class="">I have some problem on describing the RegisterInfo.<br class=""><br class=""></div><div class="">Intel GPU launches lots of hardware threads to do GPGPU workload.</div><div class="">Each hardware thread has 128 registers(r0-r127), with each one of size 32 byte.<br class=""></div><div class="">Each hardware thread may run in SIMD 8/16/32 way, which maps to</div><div class="">8/16/32 OpenCL working items. And the SIMD width is chosen at</div><div class="">compile time (normally chosen according to register pressure, bigger simd width means bigger register pressure).</div><div class="">Note each instruction has each own exec-width, which may not be equal to program SIMD width.</div><div class="">Normally we would allocate contiguous registers for divergent value.</div><div class="">For example, we have a program compiled as SIMD 8, we need to allocate 4 byte*8=32 byte</div><div class="">value for a divergent float/i32 value. But if there is a 'short type' value,</div><div class="">it only needs 2 byte*8=16 byte, that is half of a 32-byte-register.<br class=""></div><div class="">we may also allocate for 'uniform' value, a uniform value only needs type-sized register,</div><div class="">without multiply 'simd-width'. A uniform float/i32 value only needs 4 byte physical register.</div><div class="">Thus a 32-byte-register can hold up to 8 different uniform float/i32 values.<br class=""></div></div></div></blockquote><div class=""><br class=""></div></span><div class="">As a GPU backend maintainer, I strongly discourage trying to model the total register bank of the GPU in LLVM. Just model one thread. This will make things much, much easier.</div><span class=""><blockquote type="cite" class=""><div dir="ltr" class=""><div class=""></div></div></blockquote><br class=""></span></div><div class="">—escha</div></div></blockquote></div><br class=""></div></div>

</div></blockquote></div><br class=""></div></div></body></html>