<div dir="ltr">On 25 January 2013 07:56, Nadav Rotem <span dir="ltr"><<a href="mailto:nrotem@apple.com" target="_blank">nrotem@apple.com</a>></span> wrote:<br><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

You need to implement something like Whole Function Vectorization (<a href="http://dl.acm.org/citation.cfm?id=2190061" target="_blank">http://dl.acm.org/citation.cfm?id=2190061</a>). The loop vectorizer can't help you here. Ralf Karrenberg open sourced his implementation on github. You should take a look.<br>

</blockquote><div><br></div><div style>It'd be great to have this in LLVM, though some care must be taken to continue relevant (unlike the C back-end, for example). There are lots of secrets around GPUs and OpenCL concrete implementation, which could make very hard to predict or model costs for each different GPU.</div>

<div style><br></div><div style>cheers,</div><div style>--renato</div></div></div></div>