[llvm-dev] enabling interleaved access loop vectorization

Wed Aug 17 17:57:12 PDT 2016

So, at least for this example, it looks like we actually want to vectorize
with -enable-interleaved-mem-accesses, we just need the backend to generate
good code for the vector types that produces, specifically, in this case,
<12 x i8>. The details are in PR29025.

The upshot of this is that for the original program (with an outer loop
around it):

$ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c -mavx
&& time ~/llvm/temp/rgb2yik.exe
real 0m2.229s
user 0m2.224s
$ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c -mavx
-mllvm -enable-interleaved-mem-accesses && time ~/llvm/temp/rgb2yik.exe
real 0m2.590s
user 0m2.584s

This indicates that we do have a slight cost modeling issue - the cost
model is not quite conservative enough in case we really do use inserts and
extracts. One thing we're probably not accounting for is a bunch of GPR
spills  - although I'm not sure *why* we end up spilling so much. So
perhaps this should also be fixed in regalloc.

But if we modify the program by adding "*out++ = 0" right after "*out++ =
q;" (thus eliminating the pesky <12 x i8>), we get:

$ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c -mavx
&& time ~/llvm/temp/rgb2yik.exe
real 0m2.257s
user 0m2.256s
$ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c -mavx
-mllvm -enable-interleaved-mem-accesses && time ~/llvm/temp/rgb2yik.exe
real 0m0.958s
user 0m0.956s

On Wed, Aug 17, 2016 at 2:56 PM, Michael Kuperstein <mkuper at google.com>
wrote:

> Thanks Ayal!
>
> On Wed, Aug 17, 2016 at 2:14 PM, Zaks, Ayal <ayal.zaks at intel.com> wrote:
>
>> Hi Michael,
>>
>>
>>
>> Don’t quite have a full reproducer for you yet. You’re welcome to try and
>> see what’s happening in 32 bit mode when enabling  interleaving for the
>> following, based on “https://en.wikipedia.org/wiki/YIQ#From_RGB_to_YIQ”:
>>
>>
>>
>> void rgb2yik (char * in, char * out, int N)
>>
>> {
>>
>>   int j;
>>
>>   for (j = 0; j < N; ++j) {
>>
>>     unsigned char r = *in++;
>>
>>     unsigned char g = *in++;
>>
>>     unsigned char b = *in++;
>>
>>     unsigned char y = 0.299*r + 0.587*g + 0.114*b;
>>
>>     signed char i = 0.596*r + -0.274*g + -0.321*b;
>>
>>     signed char q = 0.211*r + -0.523*g + 0.312*b;
>>
>>     *out++ = y;
>>
>>     *out++ = (unsigned char)i;
>>
>>     *out++ = (unsigned char)q;
>>
>>   }
>>
>> }
>>
>>
>>
>> but you’d currently need to force it to vectorize to overcome its
>> expected cost.
>>
>>
>>
>> Ayal.
>>
>>
>>
>> *From:* Michael Kuperstein [mailto:mkuper at google.com]
>> *Sent:* Wednesday, August 17, 2016 00:51
>> *To:* Zaks, Ayal <ayal.zaks at intel.com>; Demikhovsky, Elena <
>> elena.demikhovsky at intel.com>
>> *Cc:* Renato Golin <renato.golin at linaro.org>; Matthew Simpson <
>> mssimpso at codeaurora.org>; Nema, Ashutosh <Ashutosh.Nema at amd.com>; Sanjay
>> Patel <spatel at rotateright.com>; llvm-dev <llvm-dev at lists.llvm.org>
>>
>> *Subject:* Re: [llvm-dev] enabling interleaved access loop vectorization
>>
>>
>>
>> Hi Ayal, Elena,
>>
>>
>>
>> I'd really like to enable this by default.
>>
>>
>>
>> As I wrote above, I didn't see any regressions in internal benchmarks,
>> and there doesn't seem to be anything in SPEC2006 either. I do see a
>> performance improvement in an internal benchmark (that is, a real
>> workload).
>>
>>
>>
>> Would you be able to provide an example that gets pessimized? I have no
>> doubt you've seen regressions related to this, but the fact they exist
>> doesn't help me analyze them as long as I can't see them. :-) I'd really
>> rather look at regressions before making the change - and either try to
>> make the necessary improvements to the cost model, or abandon this as
>> unfeasible for now (pending Ashutosh's work).
>>
>>
>>
>> If you can't, an alternative is to turn this on, and then, if regressions
>> show up on anyone's radar (where we can actually get a reproducer), turn it
>> off again and go back to analysis. But I'd strongly prefer to "prefetch"
>> the problem.
>>
>>
>>
>> Thanks,
>>
>>   Michael
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Aug 10, 2016 at 4:32 PM, Michael Kuperstein <mkuper at google.com>
>> wrote:
>>
>> So, unfortunately, it turns out I don't have access to DENBench.
>>
>>
>>
>> Do you happen to have a reduced example that gets pessimized by this?
>>
>>
>>
>> On Tue, Aug 9, 2016 at 11:25 AM, Michael Kuperstein <mkuper at google.com>
>> wrote:
>>
>> Thanks Ayal!
>>
>>
>>
>> I'll take a look at DENBench.
>>
>>
>>
>> As another data point - I tried enabling this on our internal benchmarks.
>> I'm seeing one regression, and it seems to be a regression of the "good"
>> kind - without interleaving we don't vectorize the innermost loop, and with
>> interleaving we do. The vectorized loop is actually significantly faster
>> when benchmarked in isolation, but in this specific instance, the static
>> loop count is unknown, and the dynamic loop count happens to almost always
>> be 1 - and this lives inside a hot outer loop.
>>
>> That's something we ought to be handling through PGO (or, conceivably,
>> outer loop vectorization :-) ).
>>
>>
>>
>> Michael
>>
>>
>>
>> On Mon, Aug 8, 2016 at 3:21 PM, Zaks, Ayal <ayal.zaks at intel.com> wrote:
>>
>> > We also need to understand what to do with edge elements in the vector
>> if their loading is not required. We, probably, should issue a masked load
>> in this case.
>>
>>
>>
>> The existing code solves such edge cases where the last element of an
>> InterleaveGroup is absent by making sure the last iteration (and up to last
>> VF iterations) are peeled and executed scalarly; see requiresScalarEpilogue.
>>
>>
>>
>>
>>
>> > All regressions that we see are in 32-bit mode.
>>
>>
>>
>> One place to find them, using the default BaseT::getInterleavedMemoryOpCost(),
>> is DENBench’s RGB conversions.
>>
>>
>>
>> Ayal.
>>
>>
>>
>> *From:* Demikhovsky, Elena
>> *Sent:* Monday, August 08, 2016 00:09
>> *To:* Michael Kuperstein <mkuper at google.com>; Renato Golin <
>> renato.golin at linaro.org>
>> *Cc:* Matthew Simpson <mssimpso at codeaurora.org>; Nema, Ashutosh <
>> Ashutosh.Nema at amd.com>; Sanjay Patel <spatel at rotateright.com>; llvm-dev <
>> llvm-dev at lists.llvm.org>; Zaks, Ayal <ayal.zaks at intel.com>
>> *Subject:* RE: [llvm-dev] enabling interleaved access loop vectorization
>>
>>
>>
>> We checked the gathered data again. All regressions that we see are in
>> 32-bit mode. The 64-bit mode looks good overall.
>>
>>
>>
>> -          * Elena*
>>
>>
>>
>> *From:* Michael Kuperstein [mailto:mkuper at google.com <mkuper at google.com>]
>>
>> *Sent:* Saturday, August 06, 2016 02:56
>> *To:* Renato Golin <renato.golin at linaro.org>
>> *Cc:* Demikhovsky, Elena <elena.demikhovsky at intel.com>; Matthew Simpson <
>> mssimpso at codeaurora.org>; Nema, Ashutosh <Ashutosh.Nema at amd.com>; Sanjay
>> Patel <spatel at rotateright.com>; llvm-dev <llvm-dev at lists.llvm.org>;
>> Zaks, Ayal <ayal.zaks at intel.com>
>> *Subject:* Re: [llvm-dev] enabling interleaved access loop vectorization
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Aug 5, 2016 at 4:37 PM, Renato Golin <renato.golin at linaro.org>
>> wrote:
>>
>> On 6 August 2016 at 00:18, Michael Kuperstein <mkuper at google.com> wrote:
>> > I agree that we can get *more* improvement with better cost modeling,
>> but
>> > I'd expect to be able to get *some* improvement the way things are right
>> > now.
>>
>> Elena said she saw "some" improvements. :)
>>
>>
>>
>> I didn't mean "some improvements, some regressions", I meant "some of the
>> improvement we'd expect from the full solution". :-)
>>
>>
>>
>>
>> > That's why I'm curious about where we saw regressions - I'm wondering
>> > whether there's really a significant cost modeling issue I'm missing, or
>> > it's something that's easy to fix so that we can make forward progress,
>> > while Ashutosh is working on the longer-term solution.
>>
>> Sounds like a task to try a few patterns and fiddle with the cost model.
>>
>> Arnold did a lot of those during the first months of the vectorizer,
>> so it might be just a matter of finding the right heuristics, at least
>> for the low hanging fruits.
>>
>> Of course, that'd also involve benchmarking everything else, to make
>> sure the new heuristics doesn't introduce regressions on
>> non-interleaved vectorisation.
>>
>>
>>
>> I don't disagree with you.
>>
>>
>>
>> All I'm saying is that before fiddling with the heuristics, it'd be good
>> to understand what exactly breaks if we simply flip the flag. If the answer
>> happens to be "nothing" - well, problem solved. Unfortunately, according to
>> Elena, that's not the answer.
>>
>> I'm going to play with it with our internal benchmarks, but it's my
>> understanding that Elena/Ayal already have some idea of what the problems
>> are.
>>
>>
>>
>> ---------------------------------------------------------------------
>> Intel Israel (74) Limited
>>
>> This e-mail and any attachments may contain confidential material for
>> the sole use of the intended recipient(s). Any review or distribution
>> by others is strictly prohibited. If you are not the intended
>> recipient, please contact the sender and delete all copies.
>>
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> Intel Israel (74) Limited
>>
>> This e-mail and any attachments may contain confidential material for
>> the sole use of the intended recipient(s). Any review or distribution
>> by others is strictly prohibited. If you are not the intended
>> recipient, please contact the sender and delete all copies.
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160817/136bb493/attachment.html>