[LLVMdev] GSoC:Loop Reversal Transformation

Wed Mar 18 11:47:38 PDT 2015

Hi Matt,

Thanks for the input.

I ran the same program attached earlier "Loop.cpp" with -O3. Using shell
script on terminal i ran it for 10000 times and redirected the data to a
file.

1. clang++ -O3 Loop.cpp
2. $ for i in {1..10000}; do ./a.out >> qwerty; done
3. Sorted out the result in excel sheet.

I have also looked at the assembly generated, and agree to your point that
there is not much difference in instructions.
However, in my opinion, and as some of the papers in links pasted in
previous mail suggest, that this might open up opportunity for some other
optimizations to come into picture.

For ex as shown earlier,

do i = 1, n
  do j = 1, n
    a[i,j] = a[i-1,j+1] + 1
  end do
end do

for above example, we cannot interchange or parallelize inner loop because
of (1, -1 ) dependency.

However, if we traverse the inner loop in reverse,

do i = 1, n
  do j = n, 1, -1
    a[i,j] = a[i-1,j+1] + 1
  end do
end do

the resultant dependency is (1, 1), which can now be used for inner loop
parallelization or loop interchange (which might improve locality reference
and result in less cache miss). Loop Interchange pass is already there in
llvm trunk.

What are your thoughts on it - this pass creating opportunities for other
passes to kick in? I might be wrong though.

Awaiting for your valuable inputs.(sorry for late reply because of
difference in time zone)

Regards,
Vishal Sarda,
3rd Year Undergraduate,
Department of Computer Engineering,
College of Engineering, Pune

On Wed, Mar 18, 2015 at 11:14 PM, Matt Godbolt <matt at godbolt.org> wrote:

> Hi Vishal,
>
> I am still a little confused - can you share your new code?  Be aware that
> you're relying on undefined behaviour: the "unisgned int" you are adding to
> is overflowing, and clang will use this fact.
>
> I was able to replicate your initial results but I found that if I ran the
> counting code twice, the second time both types of loop took much the same
> time:
>
> $ make; and ./loop
> make: Nothing to be done for `all'.
> First run:
> 0:834
> 0:405
> Second run:
> 0:381
> 0:384
>
> (on a 3.5GHz Haswell).
>
> I hacked together a very simplistic version using a memory barrier to
> defeat the optimizer instead of adding (which means the loop actually gets
> compiled: http://url.godbolt.org/hqwkx ). Additionally, I used the CPU
> cycle counter on x86 (again, very simplistically, there's a huge art to
> this: I can share other resources if you're interested). Counting to 2.4
> billion takes an appreciable amount of time now :)
>
> The results from that:
>
> /t/lop $ make; and ./loop2
> First run:
> Counting up:   2528555165 cycles
> Counting down: 2524361064 cycles
> Second run:
> Counting up:   2495025808 cycles
> Counting down: 2511668857 cycles
>
> Source and makefile attached.
>
> If you look at the code clang is producing in both cases (see the URL) it
> has already transformed the code into effectively the same: adding and/or
> subtracting until hitting zero. At least in this case, where the loop
> counter value itself isn't used.
>
> Sorry I haven't added much to the discussion, but hopefully this email is
> useful in gauging the relevance of the feature.
>
> Best regards, Matt
>
>
> On Wed, Mar 18, 2015 at 10:33 AM, vishal sarda <vishalksarda at gmail.com>
> wrote:
>
>> Hi Matt, All,
>>
>> Thanks for looking into this and your suggestions.
>>
>> I compiled the program with -O3 optimization level (clang -O3) on X86_64
>> target.
>> After your suggestion, i ran the program with iteration of 10000 runs and
>> found that average runtimes are (attaching data collected as well)
>>
>> Forward loop traverse : 2.006 milli seconds
>> Reverse loop traverse : 1.531 milli seconds
>>
>> Yes, i agree that this sample program may not be sufficient to say that
>> loop reversal traversing will always be faster, however, difference in
>> runtime is visible though. And that's where the profitability calculation
>> comes into picture.
>>
>> I found mention of loop traverse in various papers (links mentioned in
>> previous mail), and hence thought of implementing it.
>>
>> I am not sure though if its really helpful. Above papers mentioned that
>> it might not be beneficial in itself, but opens up the opportunity for
>> other optimizations.
>>
>> Suggestions on this are most welcomed. Waiting for others to pitch in too.
>>
>> Regards,
>> Vishal Sarda,
>> 3rd Year Undergraduate,
>> Department of Computer Engineering,
>> College of Engineering, Pune
>>
>> On Wed, Mar 18, 2015 at 1:20 AM, Matt Godbolt <matt at godbolt.org> wrote:
>>
>>> Hi,
>>>
>>> On Tue, Mar 17, 2015 at 2:00 PM, vishal sarda <vishalksarda at gmail.com>
>>> wrote:
>>>
>>>> [snip]
>>>>
>>>> Loop counting up 0.15 ms
>>>>
>>>> Loop counting down 0.08 ms
>>>>
>>> I'm no llvm expert, but as an interest bystander: I suspect you compiled
>>> the source without any optimizations applied - I tried to replicate this
>>> behaviour and found the optimzer happily replaces both the inner loops you
>>> had with a constant, and thus I got the same time on both loops. (e.g. see
>>> http://goo.gl/aXFkVb )
>>>
>>> Benchmarks of this nature where the run time is so small are notoriously
>>> prone to measurement errors, so I'd be a little careful drawing conclusions
>>> from the sample you listed. Also; what architecture did you measure on, and
>>> what spec machine?
>>>
>>> Not at all to dissuade you from investigating this optimization! I'm
>>> just a little sceptical of the benchmark you posted!
>>>
>>> Best regards,
>>>
>>> Matt
>>>
>>>
>>
>
>
> --
> Matt
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150319/52dd1a8e/attachment.html>