<div dir="ltr"><div dir="ltr">Hi Alexey, Hal, and James,<div><br></div><div>Please see my response inline below:</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Jun 29, 2019 at 8:30 AM Alexey Bataev <<a href="mailto:a.bataev@hotmail.com">a.bataev@hotmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="auto">
Hi Hal,<br>
<br>
<div id="gmail-m_8545407765551051713AppleMailSignature" dir="ltr">Best regards,
<div>Alexey Bataev</div>
</div>
<div dir="ltr"><br>
28 июня 2019 г., в 23:46, Finkel, Hal J. via cfe-dev <<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>> написал(а):<br>
<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<p>Hi, Alexey, Lingda,</p>
<p>I haven't been following this closely, so a few questions/comments:</p>
<p> 1. Recursive mappers are not supported in OpenMP 5, but do we expect that to change in the future?</p>
</div>
</blockquote>
Good question. Do not know, actually, but I think both of those schemes can be adapted to support recursive mappers.<br></div></blockquote><div><br></div><div>I agree. It will be trivial to support recursive mappers within the framework of these schemes if needed in the future. In case of recursive mappers, mapper functions won't be able to fully inlined in scheme 1, so compiler optimization may be limited.</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="auto">
<blockquote type="cite">
<div dir="ltr">
<p> 2. Our experience so far suggests that the most important optimization in this space is to limit the number of distinct host-to-device transfers (or data copies) on systems where data needs to be copied. In these schemes, where does that coalescing occur?</p>
</div>
</blockquote>
<div>In both schemes we transfer the data only ones, after we gather all the required data mapping info and after that we transfer it to the device at once. The only difference in these schemes is the number of runtime functions calls required to fill this
mapping data.</div></div></blockquote><div><br></div><div>Both schemes can do such coalescing in the runtime after all mapping information is collected. Scheme 1 can also do such coalescing in the compiler optimization of mapper function, it will be hard to do so though.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="auto">
<br>
<blockquote type="cite">
<div dir="ltr">
<p> 3. So long as the mappers aren't recursive, I agree with Alexey that the total number of to-be-mapped components should be efficient to calculate. The counting function should simplify to a trivial expression in nearly all cases. The only case where it
might not is where the type contains an array section with dynamic bounds, and the element type also has a mapper with an array section with dynamic bounds. In this case (similar to the unsupported recursive cases, which as an aside, we should probably support
it as an extension) we could need to walk the data structure twice to precalculate the number of total components to map. However, this case is certainly detectable by static analysis of the declared mappers, and so I think that we can get the best of both
worlds: we could use Alexey's proposed scheme except in cases where we truly need to walk the data-structure twice, in which case we could use Lingda's combined walk/push_back scheme. Is there any reason why that wouldn't work?</p>
</div>
</blockquote>
I think it is better to use only one scheme. I rather doubt that we can implement some kind of analysis in the frontend. Later, when the real codegen is moved to the backend, we can try to implement 2 schemes. But not today. We need to choose one and I just
want to hear all pros and cons for both (actually, there are 3 schemes already) schemes to choose the most flexible, reliable and fast one.<br></div></blockquote><div><br></div><div>The benefit of scheme 2 is to have memory preallocated instead of using push_back().</div><div>Hal, do you think the performance overhead of push_back() is larger than the overhead of precalculating total size, and why?</div><div><br></div><div>Thanks,</div><div>Lingda Li</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="auto">
<blockquote type="cite">
<div dir="ltr">
<p>Thanks again,</p>
<p>Hal<br>
</p>
<div class="gmail-m_8545407765551051713moz-cite-prefix">On 6/28/19 9:00 AM, Alexey Bataev wrote:<br>
</div>
<blockquote type="cite">
<p><font size="2">Hi Lingda, thanks for your comments.</font><br>
<font size="2">We can allocate the buffer either by allocating it on the stack or calling OpenMP allocate function.</font><br>
<font size="2">With this solution, we allocate memory only once (no need to resize buffer after push_backs) and we do not need to call the runtime function to put map data to the buffer, compiler generated code can do it.</font><br>
<font size="2">But anyway, I agree, it would be good to hear some other opinions.</font><br>
<font size="2">--------------</font><br>
<font size="2">Best regards,</font><br>
<font size="2">Alexey Bataev</font></p>
</blockquote>
<br>
<blockquote type="cite">
<p><font size="2">...</font></p>
</blockquote>
<pre class="gmail-m_8545407765551051713moz-signature" cols="72">--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
</div>
</blockquote>
</div>
</blockquote></div></div>