<div dir="ltr">Hi James,<div><br></div><div>Polly has some transformation on the data too (mainly used when targeting GPU back-ends). The thing is that Polly, being based on the polyhedral model, can only work on "regular" transformations. That is, sparse data is hard to handle (position of the data depends on run-time values), but dense data is usually fine.</div><div><br></div><div>As for you question on data locality, yes, Polly is actually focused on data locality (temporal locality for now, but spatial locality is in project) and parallelism. But again, because Polly works best on dense algorithm, I am not sure how it would perform on the examples you suggest.</div><div><br></div><div>In any case, those discussions are welcome !</div><div><br></div><div>Note that most cache oblivious approaches usually only change the iteration orders without changing the data too. But you are right in assuming they are tightly correlated.</div><div><br></div><div>Best Regard.</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Sat, Oct 21, 2017 at 3:08 AM, James Courtier-Dutton via llvm-dev <span dir="ltr"><<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>

<br>

My understanding of Polly is that it rearranges the executable<br>

instructions in order to perform the task quicker.<br>

Are there any tools out there that rearrange the data so that it is<br>

processed quicker?<br>

<br>

For example, say you start with a sparse(but still all in RAM) data<br>

set, and you wish to do computations on it.<br>

I have found that this can be done much quicker if you collect all the<br>

data to be processed into batches, make the batches of data<br>

contiguous, so that each batch fits in the CPU cache, and then process<br>

that batch while only having to access the CPU cache.<br>

Then afterwards, copy out the results back to the sparse layout.<br>

I know this general approach is called "Cache Oblivious Algorithms",<br>

but I was wondering if any compiler optimizations could do this for<br>

you.<br>

<br>

For example, if an algorithm had 10 processing steps, and for each<br>

step it scanned the entire data set. An optimization could be to do<br>

all 10 processing steps on the first data item, and then move to the<br>

next item etc.  This would process the data much faster due to the<br>

better use of the cache.<br>

<br>

Obviously, this cannot be done in all cases, but does something like<br>

polly take into account data locality like the examples above? Enough<br>

so, that is would even add extra malloc and memcopys where needed.<br>

<br>

Kind Regards<br>

<br>

James<br>

______________________________<wbr>_________________<br>

LLVM Developers mailing list<br>

<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>

<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/<wbr>mailman/listinfo/llvm-dev</a><br>

</blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><b>Alexandre Isoard</b><br></div></div>

</div>