[Libclc-dev] [PATCH 1/9] RFC: Refactor clcmacro.h to vectorize without hi/lo

Fri Aug 1 21:27:48 PDT 2014

On Fri, Aug 1, 2014 at 4:57 PM, Tom Stellard <tom at stellard.net> wrote:
> On Fri, Aug 01, 2014 at 03:48:10PM -0500, Aaron Watry wrote:
>> On Thu, Jul 31, 2014 at 8:50 PM, Tom Stellard <tom at stellard.net> wrote:
>> > On Wed, Jul 30, 2014 at 05:34:27PM -0500, Aaron Watry wrote:
>> >> LLVM: git 6800393083a4030 / svn r213860 (From 7/24)
>> >> CLANG: git 5ce5cbd836b3 / svn r213853
>> >> libclc: git a63df067faf8a / svn r213762 with the following two patches applied:
>> >>
>> >> r600: Actually use vstore assembly optimizations
>> >> r600: improve float vload/vstore path
>> >>
>> >> With that setup, and mesa 0ddc28b026688d, I get failures in the
>> >> float16 nextafter tests.
>> >>
>> >
>> > OK, I will try this.  I think I may not have had those two libclc
>> > patches applied when I tested.
>> >
>>
>> That would be helpful.  I'm trying to recover at the moment from
>> something that got upgraded and broke all of my systems which are
>> built against upgraded mesa/llvm versions. Until I can resolve that
>> issue, I'm unable to do any CL dev work on Evergreen and CL or GL on
>> SI.
>>
>
> Could this be caused by the recent version bump of LLVM from 3.5 to 3.6?
>

Nope, LLVM seems fine. I've rebuilt it from scratch and nuked my
install prefix before re-installing my entire graphics stack and
force-removing all of the distro-provided
mesa/llvm/xserver-xorg-video-ati/glamor packages.

I've spent most of the evening on it, but eventually tracked the issue
on my radeonsi to the following mesa commit (which matches what I
bisected on my evergreen system earlier):

3b176c441b7ddc5f7d2f891da3f76cf3c1814ce1 is the first bad commit
commit 3b176c441b7ddc5f7d2f891da3f76cf3c1814ce1
Author: Giovanni Campagna <gcampagna at src.gnome.org>
Date:   Wed Jul 23 19:37:31 2014 +0100

    gallium: Add a dumb drm/kms winsys backed swrast provider

    Add a new winsys and target that can be used with a dri2 state tracker
    and loader instead of drisw. This allows to use gbm as a dri2/image
    loader and avoid the extra copy from the backbuffer to the shadow
    frontbuffer.

    The new driver is called "kms_swrast", and is loaded by gbm as a
    fallback, because it is only useful with the gbm platform (as no buffer
    sharing is possible)

    To force select the driver set the environment variable
    GBM_ALWAYS_SOFTWARE

    [Emil Velikov]
     - Rebase on top of gallium megadriver.
     - s/text/test/ in configure.ac (Spotted by Andreas Pokorny).
     - Add scons support for winsys/sw/kms-dri and fix the build.
     - Provide separate DriverAPI, due to different InitScreen hook.

    Signed-off-by: Emil Velikov <emil.l.velikov at gmail.com>

So for now, I can at least run cl programs again on my home system,
which claims that the nextafter test still fails with
llvm/clang/libclc head revisions with the two patches in question
applied (enable vstore bitcode optimizations on r600 and then add
float to the i32 load/store paths)...

So it doesn't look like Matt's alignment changes helped this scenario,
or maybe we're still running into another bug (which the clcmacro.h
refactor papers over).  At least on my system (pitcairn).

--Aaron

> -Tom
>
>> If we can confirm that the llvm patches for alignment issues fixed the
>> use of those 2 patches on their own, then I'm at least happy to drop
>> the clcmacro.h refactoring.  I just can't actually test that myself at
>> the moment.
>>
>> --Aaron
>>
>> > -Tom
>> >
>> >> I'll give a shot at updating llvm/clang when I get home and see if
>> >> that helps things out.  I had noticed earlier today that Matt had
>> >> recently committed a bunch of vector load/store and alignment changes
>> >> (r214055 looks especially applicable), so maybe something in there did
>> >> the trick.
>> >>
>> >> --Aaron
>> >>
>> >> On Wed, Jul 30, 2014 at 3:39 PM, Tom Stellard <tom at stellard.net> wrote:
>> >> > On Sat, Jul 26, 2014 at 03:12:30PM -0500, Aaron Watry wrote:
>> >> >> On Fri, Jul 25, 2014 at 9:40 PM, Tom Stellard <tom at stellard.net> wrote:
>> >> >> > On Tue, Jul 22, 2014 at 08:46:42PM -0500, Aaron Watry wrote:
>> >> >> >> There are odd things happening with 16x vectors in nextafter() and sign().
>> >> >> >>
>> >> >> >> When I changed float16 load/store to use the assembly path in later
>> >> >> >> patches instead of the macro-vectorized version, nextafter and sign
>> >> >> >> stopped working (next 2 commits/patches).
>> >> >> >>
>> >> >> >> As near as I can tell, we're getting correct results for
>> >> >> >> elements 0-7, but then elements 8+ are wrong. The final result
>> >> >> >> seems to be composed of the first 8 elements of the computed
>> >> >> >> result, and then elements 16-23, which are likely uninitialized.
>> >> >> >>
>> >> >> >> I'd like to say the issue is in clang, but I have nothing to back
>> >> >> >> that up at the moment and give that this patch fixes the issue, I’m
>> >> >> >> not sure how much time I want to spend investigating right now.
>> >> >> >>
>> >> >> >> Explicitly splitting all of the vectorize macros the way that this patch
>> >> >> >> does gets everything working again, but I have a feeling that we're
>> >> >> >> papering over a bug somewhere, hence the RFC subject.
>> >> >> >>
>> >> >> >> I’m not sure what’s going on here, and it’s only the nextafter/sign
>> >> >> >> functions that regressed. This patch fixes the test results in piglit.
>> >> >> >>
>> >> >> >> No significant change on number of instructions in bitcode
>> >> >> >> for nextafter float16 (2-3 instructions savings over ~350 lines).
>> >> >> >>
>> >> >> >
>> >> >> > Were you testing this on Evergreen?  The sign() test passes for me on
>> >> >> > SI without this patch.
>> >> >> >
>> >> >>
>> >> >> I don't have my EG card available at the moment, but I *think* that
>> >> >> passed.  All I've got available at the moment is my pitcairn (si). For
>> >> >> me on pitcairn, I get failures in
>> >> >> piglit/generated_tests/cl/builtin/math/builtin-float-nextafter-1.0.generated.cl
>> >> >> for only the float16 type (about half of them pass, half don't).
>> >> >>
>> >> >
>> >> > This test passes for me too.  How current is your install of LLVM, I think there
>> >> > were some shufflevector fixes/improvements recently that may have fixed this test.
>> >> >
>> >> > -Tom
>> >> >
>> >> >> What happens if you test with [1] and [2] applied (enable vstore
>> >> >> optimizations and then add float to those optimizations), but without
>> >> >> the clcmacro refactoring patch [3]?
>> >> >>
>> >> >> [1] http://www.pcc.me.uk/pipermail/libclc-dev/2014-July/000461.html
>> >> >> [2] http://www.pcc.me.uk/pipermail/libclc-dev/2014-July/000462.html
>> >> >> [3] http://www.pcc.me.uk/pipermail/libclc-dev/2014-July/000460.html
>> >> >>
>> >> >> --Aaron
>> >> >>
>> >> >> > -Tom
>> >> >> >
>> >> >> >> Signed-off-by: Aaron Watry <awatry at gmail.com>
>> >> >> >> ---
>> >> >> >>  generic/lib/clcmacro.h | 84 +++++++++++++++++++++++++++++++++++++++++++-------
>> >> >> >>  1 file changed, 73 insertions(+), 11 deletions(-)
>> >> >> >>
>> >> >> >> diff --git a/generic/lib/clcmacro.h b/generic/lib/clcmacro.h
>> >> >> >> index 730073a..64b1770 100644
>> >> >> >> --- a/generic/lib/clcmacro.h
>> >> >> >> +++ b/generic/lib/clcmacro.h
>> >> >> >> @@ -1,44 +1,106 @@
>> >> >> >>  #define _CLC_UNARY_VECTORIZE(DECLSPEC, RET_TYPE, FUNCTION, ARG1_TYPE) \
>> >> >> >>    DECLSPEC RET_TYPE##2 FUNCTION(ARG1_TYPE##2 x) { \
>> >> >> >> -    return (RET_TYPE##2)(FUNCTION(x.x), FUNCTION(x.y)); \
>> >> >> >> +    return (RET_TYPE##2){FUNCTION(x.s0), FUNCTION(x.s1)}; \
>> >> >> >>    } \
>> >> >> >>  \
>> >> >> >>    DECLSPEC RET_TYPE##3 FUNCTION(ARG1_TYPE##3 x) { \
>> >> >> >> -    return (RET_TYPE##3)(FUNCTION(x.x), FUNCTION(x.y), FUNCTION(x.z)); \
>> >> >> >> +    return (RET_TYPE##3){FUNCTION(x.s0), FUNCTION(x.s1), FUNCTION(x.s2)}; \
>> >> >> >>    } \
>> >> >> >>  \
>> >> >> >>    DECLSPEC RET_TYPE##4 FUNCTION(ARG1_TYPE##4 x) { \
>> >> >> >> -    return (RET_TYPE##4)(FUNCTION(x.lo), FUNCTION(x.hi)); \
>> >> >> >> +    return (RET_TYPE##4){ \
>> >> >> >> +      FUNCTION(x.s0), \
>> >> >> >> +      FUNCTION(x.s1), \
>> >> >> >> +      FUNCTION(x.s2), \
>> >> >> >> +      FUNCTION(x.s3), \
>> >> >> >> +    }; \
>> >> >> >>    } \
>> >> >> >>  \
>> >> >> >>    DECLSPEC RET_TYPE##8 FUNCTION(ARG1_TYPE##8 x) { \
>> >> >> >> -    return (RET_TYPE##8)(FUNCTION(x.lo), FUNCTION(x.hi)); \
>> >> >> >> +    return (RET_TYPE##8){ \
>> >> >> >> +      FUNCTION(x.s0), \
>> >> >> >> +      FUNCTION(x.s1), \
>> >> >> >> +      FUNCTION(x.s2), \
>> >> >> >> +      FUNCTION(x.s3), \
>> >> >> >> +      FUNCTION(x.s4), \
>> >> >> >> +      FUNCTION(x.s5), \
>> >> >> >> +      FUNCTION(x.s6), \
>> >> >> >> +      FUNCTION(x.s7), \
>> >> >> >> +    }; \
>> >> >> >>    } \
>> >> >> >>  \
>> >> >> >>    DECLSPEC RET_TYPE##16 FUNCTION(ARG1_TYPE##16 x) { \
>> >> >> >> -    return (RET_TYPE##16)(FUNCTION(x.lo), FUNCTION(x.hi)); \
>> >> >> >> +    return (RET_TYPE##16){ \
>> >> >> >> +      FUNCTION(x.s0), \
>> >> >> >> +      FUNCTION(x.s1), \
>> >> >> >> +      FUNCTION(x.s2), \
>> >> >> >> +      FUNCTION(x.s3), \
>> >> >> >> +      FUNCTION(x.s4), \
>> >> >> >> +      FUNCTION(x.s5), \
>> >> >> >> +      FUNCTION(x.s6), \
>> >> >> >> +      FUNCTION(x.s7), \
>> >> >> >> +      FUNCTION(x.s8), \
>> >> >> >> +      FUNCTION(x.s9), \
>> >> >> >> +      FUNCTION(x.sa), \
>> >> >> >> +      FUNCTION(x.sb), \
>> >> >> >> +      FUNCTION(x.sc), \
>> >> >> >> +      FUNCTION(x.sd), \
>> >> >> >> +      FUNCTION(x.se), \
>> >> >> >> +      FUNCTION(x.sf) \
>> >> >> >> +    }; \
>> >> >> >>    }
>> >> >> >>
>> >> >> >>  #define _CLC_BINARY_VECTORIZE(DECLSPEC, RET_TYPE, FUNCTION, ARG1_TYPE, ARG2_TYPE) \
>> >> >> >>    DECLSPEC RET_TYPE##2 FUNCTION(ARG1_TYPE##2 x, ARG2_TYPE##2 y) { \
>> >> >> >> -    return (RET_TYPE##2)(FUNCTION(x.x, y.x), FUNCTION(x.y, y.y)); \
>> >> >> >> +    return (RET_TYPE##2){FUNCTION(x.s0, y.s0), FUNCTION(x.s1, y.s1)}; \
>> >> >> >>    } \
>> >> >> >>  \
>> >> >> >>    DECLSPEC RET_TYPE##3 FUNCTION(ARG1_TYPE##3 x, ARG2_TYPE##3 y) { \
>> >> >> >> -    return (RET_TYPE##3)(FUNCTION(x.x, y.x), FUNCTION(x.y, y.y), \
>> >> >> >> -                         FUNCTION(x.z, y.z)); \
>> >> >> >> +    return (RET_TYPE##3){FUNCTION(x.s0, y.s0), FUNCTION(x.s1, y.s1), \
>> >> >> >> +                         FUNCTION(x.s2, y.s2)}; \
>> >> >> >>    } \
>> >> >> >>  \
>> >> >> >>    DECLSPEC RET_TYPE##4 FUNCTION(ARG1_TYPE##4 x, ARG2_TYPE##4 y) { \
>> >> >> >> -    return (RET_TYPE##4)(FUNCTION(x.lo, y.lo), FUNCTION(x.hi, y.hi)); \
>> >> >> >> +    return (RET_TYPE##4){ \
>> >> >> >> +      FUNCTION(x.s0, y.s0), \
>> >> >> >> +      FUNCTION(x.s1, y.s1), \
>> >> >> >> +      FUNCTION(x.s2, y.s2), \
>> >> >> >> +      FUNCTION(x.s3, y.s3), \
>> >> >> >> +    }; \
>> >> >> >>    } \
>> >> >> >>  \
>> >> >> >>    DECLSPEC RET_TYPE##8 FUNCTION(ARG1_TYPE##8 x, ARG2_TYPE##8 y) { \
>> >> >> >> -    return (RET_TYPE##8)(FUNCTION(x.lo, y.lo), FUNCTION(x.hi, y.hi)); \
>> >> >> >> +    return (RET_TYPE##8){ \
>> >> >> >> +      FUNCTION(x.s0, y.s0), \
>> >> >> >> +      FUNCTION(x.s1, y.s1), \
>> >> >> >> +      FUNCTION(x.s2, y.s2), \
>> >> >> >> +      FUNCTION(x.s3, y.s3), \
>> >> >> >> +      FUNCTION(x.s4, y.s4), \
>> >> >> >> +      FUNCTION(x.s5, y.s5), \
>> >> >> >> +      FUNCTION(x.s6, y.s6), \
>> >> >> >> +      FUNCTION(x.s7, y.s7), \
>> >> >> >> +    }; \
>> >> >> >>    } \
>> >> >> >>  \
>> >> >> >>    DECLSPEC RET_TYPE##16 FUNCTION(ARG1_TYPE##16 x, ARG2_TYPE##16 y) { \
>> >> >> >> -    return (RET_TYPE##16)(FUNCTION(x.lo, y.lo), FUNCTION(x.hi, y.hi)); \
>> >> >> >> +    return (RET_TYPE##16){ \
>> >> >> >> +      FUNCTION(x.s0, y.s0), \
>> >> >> >> +      FUNCTION(x.s1, y.s1), \
>> >> >> >> +      FUNCTION(x.s2, y.s2), \
>> >> >> >> +      FUNCTION(x.s3, y.s3), \
>> >> >> >> +      FUNCTION(x.s4, y.s4), \
>> >> >> >> +      FUNCTION(x.s5, y.s5), \
>> >> >> >> +      FUNCTION(x.s6, y.s6), \
>> >> >> >> +      FUNCTION(x.s7, y.s7), \
>> >> >> >> +      FUNCTION(x.s8, y.s8), \
>> >> >> >> +      FUNCTION(x.s9, y.s9), \
>> >> >> >> +      FUNCTION(x.sa, y.sa), \
>> >> >> >> +      FUNCTION(x.sb, y.sb), \
>> >> >> >> +      FUNCTION(x.sc, y.sc), \
>> >> >> >> +      FUNCTION(x.sd, y.sd), \
>> >> >> >> +      FUNCTION(x.se, y.se), \
>> >> >> >> +      FUNCTION(x.sf, y.sf) \
>> >> >> >> +    }; \
>> >> >> >>    }
>> >> >> >>
>> >> >> >>  #define _CLC_DEFINE_BINARY_BUILTIN(RET_TYPE, FUNCTION, BUILTIN, ARG1_TYPE, ARG2_TYPE) \
>> >> >> >> --
>> >> >> >> 1.9.1
>> >> >> >>
>> >> >> >>
>> >> >> >> _______________________________________________
>> >> >> >> Libclc-dev mailing list
>> >> >> >> Libclc-dev at pcc.me.uk
>> >> >> >> http://www.pcc.me.uk/cgi-bin/mailman/listinfo/libclc-dev