<div dir="ltr"><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Nov 14, 2014 at 1:09 PM, Sahasrabuddhe, Sameer <span dir="ltr"><<a href="mailto:sameer.sahasrabuddhe@amd.com" target="_blank">sameer.sahasrabuddhe@amd.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":3ze" class="a3s" style="overflow:hidden">1. Update the synchronization <span class="il">scope</span> field in <span class="il">atomic</span> instructions from a<span class=""><br>

   single bit to a wider field, say 32-bit unsigned integer.<br></span></div></blockquote><div><br></div><div>I think this should be an arbitrary bit width integer. I think baking any size into this is a mistake unless that size is "1".</div><div> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":3ze" class="a3s" style="overflow:hidden"><span class=""></span>

2. Retain the current default of zero as "system <span class="il">scope</span>", replacing the<br>

   current "cross thread" <span class="il">scope</span>.<br></div></blockquote><div><br></div><div>I would suggest, address-space scope.</div><div> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":3ze" class="a3s" style="overflow:hidden">

3. All other values are target-defined.<br></div></blockquote><div><br></div><div>You need to define single-thread scope.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":3ze" class="a3s" style="overflow:hidden">

4. The use of "single thread <span class="il">scope</span>" is not clear.</div></blockquote><div><br></div><div>Consider trying to read from memory written in a thread from a signal handler delivered to that thread. Essentially, there may be a need to write code which we know will execute in a single hardware thread, but where the compiler optimizations precluded by atomics need to be precluded as the control flow within the hardware thread may arbitrarily move from one sequence of instructions to another.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":3ze" class="a3s" style="overflow:hidden"> If it is required in<span class=""><br>

   target-independent transforms,</span></div></blockquote><div><br></div><div>Yes, it is. sig_atomic_t.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":3ze" class="a3s" style="overflow:hidden"><span class=""> then it could be encoded as just "1",<br>

   or as "all ones" in the wider field. The latter option is a bit<br>

   weird, because most targets will have very few <span class="il">scopes</span>. But it is<br>

   useful in case the next point is included in LLVM IR.<br></span></div></blockquote><div><br></div><div>If we go with your proposed constraint below, I think it is essential to model single-thread-scope as the maximum integer. It should be a strict subset of all inter-thread scopes.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div id=":3ze" class="a3s" style="overflow:hidden"><span class=""></span>

5. Possibly add the following constraint on <span class="il">memory</span> <span class="il">scopes</span>: "The <span class="il">scope</span><span class=""><br>

   represented by a larger value is nested inside (is a proper subset<br>

   of) the <span class="il">scope</span> represented by a smaller value." This would also imply<br>

   that the value used for single-thread <span class="il">scope</span> must be the largest<br>

   value used by the target.<br>

   This constraint on "nesting" is easily satisfied by HSAIL (and also<br>

   OpenCL), where synchronization <span class="il">scopes</span> increase from a single<br>

   work-item to the entire system. But it is conceivable that other<br>

   targets do not have this constraint. For example, a platform may<br>

   define synchronization <span class="il">scopes</span> in terms of overlapping sets instead<br>

   of proper subsets.<br></span></div></blockquote><div><br></div><div>I think this is the important thing to settle on in the design. I'd really like to hear from a diverse set of vendors and folks operating in the GPU space to understand whether having this constraint is critically important or problematic for any reasons.</div><div><br></div><div>I think (unfortunately) it would be hard to add this later...</div><div><br></div><div>-Chandler</div></div></div></div>