[cfe-commits] [PATCH] Optimize vec3 loads/stores

Fri Jul 27 02:06:57 PDT 2012

On Tue, 24 Jul 2012 08:05:18 +0530
Tobias Grosser <tobias at grosser.es> wrote:

> On 07/24/2012 01:54 AM, Dan Gohman wrote:
> >
> > On Jul 23, 2012, at 11:34 AM, Tanya Lattner <lattner at apple.com>
> > wrote:
> >
> >>
> >> On Jul 19, 2012, at 11:51 AM, Dan Gohman wrote:
> >>
> >>>
> >>> On Jul 18, 2012, at 6:51 PM, John McCall <rjmccall at apple.com>
> >>> wrote:
> >>>
> >>>> On Jul 18, 2012, at 5:37 PM, Tanya Lattner wrote:
> >>>>> On Jul 18, 2012, at 5:08 AM, Benyei, Guy wrote:
> >>>>>> Hi Tanya,
> >>>>>> Looks good and usefull, but I'm not sure if it should be
> >>>>>> clang's decision if storing and loading vec4s is better than
> >>>>>> vec3.
> >>>>>
> >>>>> The idea was to have Clang generate code that the optimizers
> >>>>> would be more likely to do something useful and smart with. I
> >>>>> understand the concern, but I'm not sure where the best place
> >>>>> for this would be then?
> >>>>
> >>>> Hmm.  The IR size of a <3 x blah> is basically the size of a <4
> >>>> x blah> anyway;  arguably the backend already has all the
> >>>> information it needs for this.  Dan, what do you think?
> >>>
> >>> I guess optimizer passes won't be extraordinarily happy about all
> >>> this bitcasting and shuffling. It seems to me that we have a
> >>> problem in that we're splitting up the high-level task of "lower
> >>> <3 x blah> to <4 x blah>" and doing some of it in the front-end
> >>> and some of it in the backend. Ideally, we should do it all in
> >>> one place, for conceptual simplicity, and to avoid the
> >>> awkwardness of having the optimizer run in a world that's half
> >>> one way and half the other, with awkward little bridges between
> >>> the two halves.
> >>
> >> I think its hard to speculate that the optimizer passes are not
> >> happy about the bit cast and shuffling. I'm running with
> >> optimizations on and the code is still much better than not having
> >> Clang do this "optimization" for vec3.
> >
> > Sorry for being unclear; I was speculating more about future
> > optimization passes. I don't doubt your patch achieves its purpose
> > today.
> >
> >> I strongly feel that Clang can make the decision to output code
> >> like this if it leads to better code in the end.
> >
> > Ok. What do you think about having clang doing all of the lowering
> > of <3 x blah> to <4 x blah> then? I mean all of the aritihmetic,
> > function arguments and return values, and so on? In other words, is
> > there something special about loads and stores of vec3, or are they
> > just one symptom of a broader vec3 problem?
> >
> > Of course, I'm not asking you do this work right now; I'm asking
> > whether this would be a better overall design.
> 
> Having clang perform this transformation will also reduce the amount
> of optimizations a bb vectorizer could possibly do. I could see e.g.
> that a loop unrolled by 4 may be transformed from 4 * <vec3> to 3 *
> <vec4>. This does not work today and will probably not work soon, but
> we should keep it in mind.

The (trunk) vectorizer should do this now: the vec3s could be combined
into vec12s, which should then be legalized to vec4s.

 -Hal

> 
> (This does not mean I am against the patch, /i just wanted to point
> this out)
> 
> Tobi
> _______________________________________________
> cfe-commits mailing list
> cfe-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory