[llvm-dev] enabling interleaved access loop vectorization

Wed Aug 17 14:56:15 PDT 2016

Thanks Ayal!

On Wed, Aug 17, 2016 at 2:14 PM, Zaks, Ayal <ayal.zaks at intel.com> wrote:

> Hi Michael,
>
>
>
> Don’t quite have a full reproducer for you yet. You’re welcome to try and
> see what’s happening in 32 bit mode when enabling  interleaving for the
> following, based on “https://en.wikipedia.org/wiki/YIQ#From_RGB_to_YIQ”:
>
>
>
> void rgb2yik (char * in, char * out, int N)
>
> {
>
>   int j;
>
>   for (j = 0; j < N; ++j) {
>
>     unsigned char r = *in++;
>
>     unsigned char g = *in++;
>
>     unsigned char b = *in++;
>
>     unsigned char y = 0.299*r + 0.587*g + 0.114*b;
>
>     signed char i = 0.596*r + -0.274*g + -0.321*b;
>
>     signed char q = 0.211*r + -0.523*g + 0.312*b;
>
>     *out++ = y;
>
>     *out++ = (unsigned char)i;
>
>     *out++ = (unsigned char)q;
>
>   }
>
> }
>
>
>
> but you’d currently need to force it to vectorize to overcome its expected
> cost.
>
>
>
> Ayal.
>
>
>
> *From:* Michael Kuperstein [mailto:mkuper at google.com]
> *Sent:* Wednesday, August 17, 2016 00:51
> *To:* Zaks, Ayal <ayal.zaks at intel.com>; Demikhovsky, Elena <
> elena.demikhovsky at intel.com>
> *Cc:* Renato Golin <renato.golin at linaro.org>; Matthew Simpson <
> mssimpso at codeaurora.org>; Nema, Ashutosh <Ashutosh.Nema at amd.com>; Sanjay
> Patel <spatel at rotateright.com>; llvm-dev <llvm-dev at lists.llvm.org>
>
> *Subject:* Re: [llvm-dev] enabling interleaved access loop vectorization
>
>
>
> Hi Ayal, Elena,
>
>
>
> I'd really like to enable this by default.
>
>
>
> As I wrote above, I didn't see any regressions in internal benchmarks, and
> there doesn't seem to be anything in SPEC2006 either. I do see a
> performance improvement in an internal benchmark (that is, a real
> workload).
>
>
>
> Would you be able to provide an example that gets pessimized? I have no
> doubt you've seen regressions related to this, but the fact they exist
> doesn't help me analyze them as long as I can't see them. :-) I'd really
> rather look at regressions before making the change - and either try to
> make the necessary improvements to the cost model, or abandon this as
> unfeasible for now (pending Ashutosh's work).
>
>
>
> If you can't, an alternative is to turn this on, and then, if regressions
> show up on anyone's radar (where we can actually get a reproducer), turn it
> off again and go back to analysis. But I'd strongly prefer to "prefetch"
> the problem.
>
>
>
> Thanks,
>
>   Michael
>
>
>
>
>
>
>
>
>
> On Wed, Aug 10, 2016 at 4:32 PM, Michael Kuperstein <mkuper at google.com>
> wrote:
>
> So, unfortunately, it turns out I don't have access to DENBench.
>
>
>
> Do you happen to have a reduced example that gets pessimized by this?
>
>
>
> On Tue, Aug 9, 2016 at 11:25 AM, Michael Kuperstein <mkuper at google.com>
> wrote:
>
> Thanks Ayal!
>
>
>
> I'll take a look at DENBench.
>
>
>
> As another data point - I tried enabling this on our internal benchmarks.
> I'm seeing one regression, and it seems to be a regression of the "good"
> kind - without interleaving we don't vectorize the innermost loop, and with
> interleaving we do. The vectorized loop is actually significantly faster
> when benchmarked in isolation, but in this specific instance, the static
> loop count is unknown, and the dynamic loop count happens to almost always
> be 1 - and this lives inside a hot outer loop.
>
> That's something we ought to be handling through PGO (or, conceivably,
> outer loop vectorization :-) ).
>
>
>
> Michael
>
>
>
> On Mon, Aug 8, 2016 at 3:21 PM, Zaks, Ayal <ayal.zaks at intel.com> wrote:
>
> > We also need to understand what to do with edge elements in the vector
> if their loading is not required. We, probably, should issue a masked load
> in this case.
>
>
>
> The existing code solves such edge cases where the last element of an
> InterleaveGroup is absent by making sure the last iteration (and up to last
> VF iterations) are peeled and executed scalarly; see requiresScalarEpilogue.
>
>
>
>
>
> > All regressions that we see are in 32-bit mode.
>
>
>
> One place to find them, using the default BaseT::getInterleavedMemoryOpCost(),
> is DENBench’s RGB conversions.
>
>
>
> Ayal.
>
>
>
> *From:* Demikhovsky, Elena
> *Sent:* Monday, August 08, 2016 00:09
> *To:* Michael Kuperstein <mkuper at google.com>; Renato Golin <
> renato.golin at linaro.org>
> *Cc:* Matthew Simpson <mssimpso at codeaurora.org>; Nema, Ashutosh <
> Ashutosh.Nema at amd.com>; Sanjay Patel <spatel at rotateright.com>; llvm-dev <
> llvm-dev at lists.llvm.org>; Zaks, Ayal <ayal.zaks at intel.com>
> *Subject:* RE: [llvm-dev] enabling interleaved access loop vectorization
>
>
>
> We checked the gathered data again. All regressions that we see are in
> 32-bit mode. The 64-bit mode looks good overall.
>
>
>
> -          * Elena*
>
>
>
> *From:* Michael Kuperstein [mailto:mkuper at google.com <mkuper at google.com>]
> *Sent:* Saturday, August 06, 2016 02:56
> *To:* Renato Golin <renato.golin at linaro.org>
> *Cc:* Demikhovsky, Elena <elena.demikhovsky at intel.com>; Matthew Simpson <
> mssimpso at codeaurora.org>; Nema, Ashutosh <Ashutosh.Nema at amd.com>; Sanjay
> Patel <spatel at rotateright.com>; llvm-dev <llvm-dev at lists.llvm.org>; Zaks,
> Ayal <ayal.zaks at intel.com>
> *Subject:* Re: [llvm-dev] enabling interleaved access loop vectorization
>
>
>
>
>
>
>
> On Fri, Aug 5, 2016 at 4:37 PM, Renato Golin <renato.golin at linaro.org>
> wrote:
>
> On 6 August 2016 at 00:18, Michael Kuperstein <mkuper at google.com> wrote:
> > I agree that we can get *more* improvement with better cost modeling, but
> > I'd expect to be able to get *some* improvement the way things are right
> > now.
>
> Elena said she saw "some" improvements. :)
>
>
>
> I didn't mean "some improvements, some regressions", I meant "some of the
> improvement we'd expect from the full solution". :-)
>
>
>
>
> > That's why I'm curious about where we saw regressions - I'm wondering
> > whether there's really a significant cost modeling issue I'm missing, or
> > it's something that's easy to fix so that we can make forward progress,
> > while Ashutosh is working on the longer-term solution.
>
> Sounds like a task to try a few patterns and fiddle with the cost model.
>
> Arnold did a lot of those during the first months of the vectorizer,
> so it might be just a matter of finding the right heuristics, at least
> for the low hanging fruits.
>
> Of course, that'd also involve benchmarking everything else, to make
> sure the new heuristics doesn't introduce regressions on
> non-interleaved vectorisation.
>
>
>
> I don't disagree with you.
>
>
>
> All I'm saying is that before fiddling with the heuristics, it'd be good
> to understand what exactly breaks if we simply flip the flag. If the answer
> happens to be "nothing" - well, problem solved. Unfortunately, according to
> Elena, that's not the answer.
>
> I'm going to play with it with our internal benchmarks, but it's my
> understanding that Elena/Ayal already have some idea of what the problems
> are.
>
>
>
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160817/fec6fdf7/attachment-0001.html>