[llvm-dev] enabling interleaved access loop vectorization

Thu Sep 1 16:47:15 PDT 2016

Yes, carefully inserting branches is the way to go!

Seriously though - you probably saw that I just committed a fix for PR29025
(r280418).
For the reproducer you provided, we now have (without forcing
vectorization, and without "padding" to have power-of-2 stride):

$ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c -mavx
&& time ~/llvm/temp/rgb2yik.exe
real 0m2.290s
user 0m2.289s
sys 0m0.003s
$ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c -mavx
-mllvm -enable-interleaved-mem-accesses && time ~/llvm/temp/rgb2yik.exe
real 0m1.095s
user 0m1.095s
sys 0m0.002s

Care to give it a spin internally?

Note that this is not a full solution - we still won't vectorize PR27619,
and force-vectorizing it is still a bad idea. Getting that right will
require more lowering improvements as well as cost model adjustments. But
hopefully post-r280418 things should be good enough to avoid regressions
for the cases we will vectorize.
If you still see regressions, more reproducers will be appreciated. :-)
If there are no more regressions, let me know, and I'll post a patch to
enable interleaved access for x86.

Thanks,
 Michael

On Thu, Sep 1, 2016 at 4:26 PM, Zaks, Ayal <ayal.zaks at intel.com> wrote:

> So turns out it is a full reproducer after all (choosing to vectorize on
> AVX), good.
>
>
>
>
>
> > The details are in PR29025.
>
>
>
> Interesting. (So we should carefully insert unconditional branches inside
> shuffle sequences, eh? ;-)
>
>
>
>
>
> > But if we modify the program by adding "*out++ = 0" right after "*out++
> = q;" (thus eliminating the pesky <12 x i8>), we get:
>
>
>
> Indeed such padding is a known (programmer) optimization to effectively
> have power-of-2 strides and/or alignment.
>
>
>
>
>
> > So, unfortunately, it turns out I don't have access to DENBench.
>
>
>
> If you like we could test your patch to see how it (mis)behaves.
>
>
>
>
>
>
>
> *From:* Michael Kuperstein [mailto:mkuper at google.com]
> *Sent:* Thursday, August 18, 2016 03:57
> *To:* Zaks, Ayal <ayal.zaks at intel.com>
> *Cc:* Demikhovsky, Elena <elena.demikhovsky at intel.com>; Renato Golin <
> renato.golin at linaro.org>; Matthew Simpson <mssimpso at codeaurora.org>;
> Nema, Ashutosh <Ashutosh.Nema at amd.com>; Sanjay Patel <
> spatel at rotateright.com>; llvm-dev <llvm-dev at lists.llvm.org>
>
> *Subject:* Re: [llvm-dev] enabling interleaved access loop vectorization
>
>
>
> So, at least for this example, it looks like we actually want to vectorize
> with -enable-interleaved-mem-accesses, we just need the backend to
> generate good code for the vector types that produces, specifically, in
> this case, <12 x i8>. The details are in PR29025.
>
>
>
> The upshot of this is that for the original program (with an outer loop
> around it):
>
>
>
> $ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c
> -mavx && time ~/llvm/temp/rgb2yik.exe
>
> real      0m2.229s
>
> user      0m2.224s
>
> $ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c
> -mavx -mllvm -enable-interleaved-mem-accesses && time
> ~/llvm/temp/rgb2yik.exe
>
> real      0m2.590s
>
> user      0m2.584s
>
>
>
> This indicates that we do have a slight cost modeling issue - the cost
> model is not quite conservative enough in case we really do use inserts and
> extracts. One thing we're probably not accounting for is a bunch of GPR
> spills  - although I'm not sure *why* we end up spilling so much. So
> perhaps this should also be fixed in regalloc.
>
>
>
> But if we modify the program by adding "*out++ = 0" right after "*out++ =
> q;" (thus eliminating the pesky <12 x i8>), we get:
>
>
>
> $ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c
> -mavx && time ~/llvm/temp/rgb2yik.exe
>
> real      0m2.257s
>
> user      0m2.256s
>
> $ bin/clang -m32 -O2 -o ~/llvm/temp/rgb2yik.exe ~/llvm/temp/rgb2yik.c
> -mavx -mllvm -enable-interleaved-mem-accesses && time
> ~/llvm/temp/rgb2yik.exe
>
> real      0m0.958s
>
> user      0m0.956s
>
>
>
> On Wed, Aug 17, 2016 at 2:56 PM, Michael Kuperstein <mkuper at google.com>
> wrote:
>
> Thanks Ayal!
>
>
>
> On Wed, Aug 17, 2016 at 2:14 PM, Zaks, Ayal <ayal.zaks at intel.com> wrote:
>
> Hi Michael,
>
>
>
> Don’t quite have a full reproducer for you yet. You’re welcome to try and
> see what’s happening in 32 bit mode when enabling  interleaving for the
> following, based on “https://en.wikipedia.org/wiki/YIQ#From_RGB_to_YIQ”:
>
>
>
> void rgb2yik (char * in, char * out, int N)
>
> {
>
>   int j;
>
>   for (j = 0; j < N; ++j) {
>
>     unsigned char r = *in++;
>
>     unsigned char g = *in++;
>
>     unsigned char b = *in++;
>
>     unsigned char y = 0.299*r + 0.587*g + 0.114*b;
>
>     signed char i = 0.596*r + -0.274*g + -0.321*b;
>
>     signed char q = 0.211*r + -0.523*g + 0.312*b;
>
>     *out++ = y;
>
>     *out++ = (unsigned char)i;
>
>     *out++ = (unsigned char)q;
>
>   }
>
> }
>
>
>
> but you’d currently need to force it to vectorize to overcome its expected
> cost.
>
>
>
> Ayal.
>
>
>
> *From:* Michael Kuperstein [mailto:mkuper at google.com]
> *Sent:* Wednesday, August 17, 2016 00:51
> *To:* Zaks, Ayal <ayal.zaks at intel.com>; Demikhovsky, Elena <
> elena.demikhovsky at intel.com>
> *Cc:* Renato Golin <renato.golin at linaro.org>; Matthew Simpson <
> mssimpso at codeaurora.org>; Nema, Ashutosh <Ashutosh.Nema at amd.com>; Sanjay
> Patel <spatel at rotateright.com>; llvm-dev <llvm-dev at lists.llvm.org>
>
>
> *Subject:* Re: [llvm-dev] enabling interleaved access loop vectorization
>
>
>
> Hi Ayal, Elena,
>
>
>
> I'd really like to enable this by default.
>
>
>
> As I wrote above, I didn't see any regressions in internal benchmarks, and
> there doesn't seem to be anything in SPEC2006 either. I do see a
> performance improvement in an internal benchmark (that is, a real
> workload).
>
>
>
> Would you be able to provide an example that gets pessimized? I have no
> doubt you've seen regressions related to this, but the fact they exist
> doesn't help me analyze them as long as I can't see them. :-) I'd really
> rather look at regressions before making the change - and either try to
> make the necessary improvements to the cost model, or abandon this as
> unfeasible for now (pending Ashutosh's work).
>
>
>
> If you can't, an alternative is to turn this on, and then, if regressions
> show up on anyone's radar (where we can actually get a reproducer), turn it
> off again and go back to analysis. But I'd strongly prefer to "prefetch"
> the problem.
>
>
>
> Thanks,
>
>   Michael
>
>
>
>
>
>
>
>
>
> On Wed, Aug 10, 2016 at 4:32 PM, Michael Kuperstein <mkuper at google.com>
> wrote:
>
> So, unfortunately, it turns out I don't have access to DENBench.
>
>
>
> Do you happen to have a reduced example that gets pessimized by this?
>
>
>
> On Tue, Aug 9, 2016 at 11:25 AM, Michael Kuperstein <mkuper at google.com>
> wrote:
>
> Thanks Ayal!
>
>
>
> I'll take a look at DENBench.
>
>
>
> As another data point - I tried enabling this on our internal benchmarks.
> I'm seeing one regression, and it seems to be a regression of the "good"
> kind - without interleaving we don't vectorize the innermost loop, and with
> interleaving we do. The vectorized loop is actually significantly faster
> when benchmarked in isolation, but in this specific instance, the static
> loop count is unknown, and the dynamic loop count happens to almost always
> be 1 - and this lives inside a hot outer loop.
>
> That's something we ought to be handling through PGO (or, conceivably,
> outer loop vectorization :-) ).
>
>
>
> Michael
>
>
>
> On Mon, Aug 8, 2016 at 3:21 PM, Zaks, Ayal <ayal.zaks at intel.com> wrote:
>
> > We also need to understand what to do with edge elements in the vector
> if their loading is not required. We, probably, should issue a masked load
> in this case.
>
>
>
> The existing code solves such edge cases where the last element of an
> InterleaveGroup is absent by making sure the last iteration (and up to last
> VF iterations) are peeled and executed scalarly; see requiresScalarEpilogue.
>
>
>
>
>
> > All regressions that we see are in 32-bit mode.
>
>
>
> One place to find them, using the default BaseT::getInterleavedMemoryOpCost(),
> is DENBench’s RGB conversions.
>
>
>
> Ayal.
>
>
>
> *From:* Demikhovsky, Elena
> *Sent:* Monday, August 08, 2016 00:09
> *To:* Michael Kuperstein <mkuper at google.com>; Renato Golin <
> renato.golin at linaro.org>
> *Cc:* Matthew Simpson <mssimpso at codeaurora.org>; Nema, Ashutosh <
> Ashutosh.Nema at amd.com>; Sanjay Patel <spatel at rotateright.com>; llvm-dev <
> llvm-dev at lists.llvm.org>; Zaks, Ayal <ayal.zaks at intel.com>
> *Subject:* RE: [llvm-dev] enabling interleaved access loop vectorization
>
>
>
> We checked the gathered data again. All regressions that we see are in
> 32-bit mode. The 64-bit mode looks good overall.
>
>
>
> -          * Elena*
>
>
>
> *From:* Michael Kuperstein [mailto:mkuper at google.com <mkuper at google.com>]
> *Sent:* Saturday, August 06, 2016 02:56
> *To:* Renato Golin <renato.golin at linaro.org>
> *Cc:* Demikhovsky, Elena <elena.demikhovsky at intel.com>; Matthew Simpson <
> mssimpso at codeaurora.org>; Nema, Ashutosh <Ashutosh.Nema at amd.com>; Sanjay
> Patel <spatel at rotateright.com>; llvm-dev <llvm-dev at lists.llvm.org>; Zaks,
> Ayal <ayal.zaks at intel.com>
> *Subject:* Re: [llvm-dev] enabling interleaved access loop vectorization
>
>
>
>
>
>
>
> On Fri, Aug 5, 2016 at 4:37 PM, Renato Golin <renato.golin at linaro.org>
> wrote:
>
> On 6 August 2016 at 00:18, Michael Kuperstein <mkuper at google.com> wrote:
> > I agree that we can get *more* improvement with better cost modeling, but
> > I'd expect to be able to get *some* improvement the way things are right
> > now.
>
> Elena said she saw "some" improvements. :)
>
>
>
> I didn't mean "some improvements, some regressions", I meant "some of the
> improvement we'd expect from the full solution". :-)
>
>
>
>
> > That's why I'm curious about where we saw regressions - I'm wondering
> > whether there's really a significant cost modeling issue I'm missing, or
> > it's something that's easy to fix so that we can make forward progress,
> > while Ashutosh is working on the longer-term solution.
>
> Sounds like a task to try a few patterns and fiddle with the cost model.
>
> Arnold did a lot of those during the first months of the vectorizer,
> so it might be just a matter of finding the right heuristics, at least
> for the low hanging fruits.
>
> Of course, that'd also involve benchmarking everything else, to make
> sure the new heuristics doesn't introduce regressions on
> non-interleaved vectorisation.
>
>
>
> I don't disagree with you.
>
>
>
> All I'm saying is that before fiddling with the heuristics, it'd be good
> to understand what exactly breaks if we simply flip the flag. If the answer
> happens to be "nothing" - well, problem solved. Unfortunately, according to
> Elena, that's not the answer.
>
> I'm going to play with it with our internal benchmarks, but it's my
> understanding that Elena/Ayal already have some idea of what the problems
> are.
>
>
>
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>
>
>
>
>
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160901/046c3cde/attachment.html>