<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><br class=""><div><blockquote type="cite" class=""><div class="">On Aug 22, 2016, at 6:46 AM, Ruiling Song via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org" class="">llvm-dev@lists.llvm.org</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class=""><div class="">Hello Everyone,</div><div class=""><br class=""></div><div class="">I am trying to make a new LLVM backend target for Intel GPU.<br class="">I would start from targeting OpenCL language first.</div><div class="">But I am not quite familiar with LLVM backend infrastructure.</div><div class="">I have some problem on describing the RegisterInfo.<br class=""><br class=""></div><div class="">Intel GPU launches lots of hardware threads to do GPGPU workload.</div><div class="">Each hardware thread has 128 registers(r0-r127), with each one of size 32 byte.<br class=""></div><div class="">Each hardware thread may run in SIMD 8/16/32 way, which maps to</div><div class="">8/16/32 OpenCL working items. And the SIMD width is chosen at</div><div class="">compile time (normally chosen according to register pressure, bigger simd width means bigger register pressure).</div><div class="">Note each instruction has each own exec-width, which may not be equal to program SIMD width.</div><div class="">Normally we would allocate contiguous registers for divergent value.</div><div class="">For example, we have a program compiled as SIMD 8, we need to allocate 4 byte*8=32 byte</div><div class="">value for a divergent float/i32 value. But if there is a 'short type' value,</div><div class="">it only needs 2 byte*8=16 byte, that is half of a 32-byte-register.<br class=""></div><div class="">we may also allocate for 'uniform' value, a uniform value only needs type-sized register,</div><div class="">without multiply 'simd-width'. A uniform float/i32 value only needs 4 byte physical register.</div><div class="">Thus a 32-byte-register can hold up to 8 different uniform float/i32 values.<br class=""></div></div></div></blockquote><div><br class=""></div><div>As a GPU backend maintainer, I strongly discourage trying to model the total register bank of the GPU in LLVM. Just model one thread. This will make things much, much easier.</div><br class=""><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class=""><br class="">Some time we also need to access register in stride way. Like a bitcast from i64 to v2i32,<br class="">we need to access the i64 register with horizontal stride of 2.</div><div class="">Look below example, the i64 value is hold in r10 and r11. L/H stands for the low 32bit/high 32bit.</div><div class="">And the simd width of the program is SIMD 8, so we have 8 pairs of L/H.<br class="">r10: L H L H L H L H<br class="">r11: L H L H L H L H</div><div class="">below two instructions will extract the low 32bit and high 32bit part.<br class="">mov(8 | M0) r12.0<1>, r10.0<8,4,2>:D</div><div class="">mov(8 | M0) r13.0<1>, r10.1<8,4,2>:D</div><div class="">(The format of a register region is RegNum.regSubNum<vertStride, width, horzStride>:type)</div><div class="">(Note the regSubNum is measured in units of the register type here.)</div><div class="">then r12/r13 contains the result vector components.</div><div class="">You can refer below link for more details on Intel GPU assembly and register usage:</div><div class=""><div class=""><a href="https://software.intel.com/en-us/articles/introduction-to-gen-assembly" target="_blank" class="">https://software.intel.com/en-<wbr class="">us/articles/introduction-to-<wbr class="">gen-assembly</a></div></div></div></div></blockquote><br class=""></div><div>—escha</div></body></html>