<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">There is a work taking place by multiple people in this area and more is expected to happen and I’d like to make sure we’re working toward a common end goal.<div class=""><br class=""></div><div class="">I tried to collect the use-cases for run-time memory checks and the specific memchecks required for each:</div><div class=""><br class=""></div><div class="">1. Loop Vectorizer: each memory access is checked against all other memory accesses in the loop (except read vs read)</div><div class="">2. Loop Distribution: only memory accesses in different partitions are checked against each other.  The loop vectorizer will add its own checks for the vectorizable distributed loops</div><div class="">3. Loop Fusion: accesses from the to-be-merged loops should be checked against each other</div><div class="">4. LICM: if hoisting a load, stores needs to be check.  When sinking a store, all accesses are checked</div><div class="">5. Load-elimination in GVN: all *intervening* stores need to be checked.</div><div class="">6. Instruction scheduling for in-order targets: same as 5 </div><div class=""><br class=""></div><div class="">Currently only the first two are implemented.  Ashutosh has a pending review for LICM/independent pass.  I am also working on loop-aware load-elimination requiring memchecks (5 from the above list).</div><div class=""><br class=""></div><div class="">The two main approaches are whether to do this in a separate pass or to do it locally in the passes that benefit from versioning.  I tried to collect the pros and cons of each.</div><div class=""><br class=""></div><div class="">1. Separate pass</div><div class=""><br class=""></div><div class="">The benefit of this approach is that current passes would not have to be modified to take advantage of the new freedom to due more independence of the memory access operations.  AA will capture the noalias annotation of inserted by this pass and present it to the passes.</div><div class=""><br class=""></div><div class="">Memchecks present an overhead at run time so one question is how we ensure that any optimization will amortize the cost of these checks.</div><div class=""><br class=""></div><div class="">Depending on the optimization, answering this could be pretty involved (i.e. almost like running the pass itself).  Consider the loop vectorizer.  In order to answer whether versioning the loop would make it vectorizable, you’d have to run most of the legality and profitability logic from the pass.</div><div class=""><br class=""></div><div class="">Also which accesses do we need to check?  Depending on the optimization we may not need to check each access against each other access, which being quadratic can be a significant difference.</div><div class=""><br class=""></div><div class="">2. Each pass performs its own versioning</div><div class=""><br class=""></div><div class="">Under this approach, each pass would make the calculation locally whether the benefit of versioning outweighs the overhead of the checks.  The current Loop Vectorizer is a good example for this.  It effectively assumes no may-alias and if the number of required checks are under a certain threshold it assumes that the vectorization gain will outweigh the cost of the checks.</div><div class=""><br class=""></div><div class="">Making decision locally is not ideal in this approach.  I.e. if we can amortize the cost of the same checks with a combination of optimization from *multiple* passes, neither pass would make the decision locally to version.</div><div class=""><br class=""></div><div class="">Also, it’s probably beneficial to perform a single loop versioning even if multiple passes would like to check different accesses.  E.g. rather than:</div><div class=""><br class=""></div><div class=""><font face="Courier New" class="">    Checks_1</font></div><div class=""><font face="Courier New" class="">    /       \</font></div><div class=""><font face="Courier New" class="">   /         \</font></div><div class=""><font face="Courier New" class="">OrigLoop    Checks_2</font></div><div class=""><font face="Courier New" class="">   \         /  \</font></div><div class=""><font face="Courier New" class="">    \       /    \</font></div><div class=""><font face="Courier New" class="">     \ NoAlias_1 NoAlias_2</font></div><div class=""><font face="Courier New" class="">      \    |    /</font></div><div class=""><font face="Courier New" class="">       \   |   /</font></div><div class=""><font face="Courier New" class="">        \  |  /</font></div><div class=""><span style="font-family: 'Courier New';" class="">          Join</span></div><div class=""><span style="font-family: 'Courier New';" class=""><br class=""></span></div><div class="">But instead:</div><div class=""><br class=""></div><div class=""><font face="Courier New" class="">   Checks_1+Check_2</font></div><div class=""><font face="Courier New" class="">      /      \</font></div><div class=""><font face="Courier New" class="">     /        \</font></div><div class=""><font face="Courier New" class="">  OrigLoop   NoAlias_2</font></div><div class=""><font face="Courier New" class="">     \        /</font></div><div class=""><font face="Courier New" class="">      \      /</font></div><div class=""><font face="Courier New" class="">       \    /</font></div><div class=""><font face="Courier New" class="">        Join</font></div><div class=""><font face="Courier New" class=""><br class=""></font></div><div class="">This is effectively creating a fast-path and a slow-path version of the loop.  We would probably need some metadata annotation so that subsequent passes could amend the same checking block.</div><div class=""><br class=""></div><div class="">3. There are some more futuristic ideas like to always version fully, disambiguating all accesses and then have the optimizers transform both the versions of the loop.  Then a later pass would decide which checks were necessary and what additional optimizations were done in the speculative version of the loop .  Then finally make the cost decision of whether to keep the speculative version along with the checks or remove them.  This would probably need a fairly sophisticated set of metadata.</div><div class=""><br class=""></div><div class="">I think that a combination of 1 and 2 makes sense.  This is where we will effectively end up after Ashutosh’s patch.  This would hopefully give us the best of both worlds: the aggressiveness/precision of making the call locally in complex passes and the option of simplicity/orthogonality of the separate pass.</div><div class=""><br class=""></div><div class="">Thoughts?</div><div class=""><br class=""></div><div class="">Adam</div><div class=""><br class=""></div></body></html>