[llvm-dev] masked-load endpoints optimization

Sanjay Patel via llvm-dev llvm-dev at lists.llvm.org
Mon Mar 14 10:06:38 PDT 2016


I checked in a patch to do this transform for x86-only for now:
http://reviews.llvm.org/D18094 / http://reviews.llvm.org/rL263446

On Fri, Mar 11, 2016 at 9:57 AM, Sanjay Patel <spatel at rotateright.com>
wrote:

> Thanks, Ashutosh.
>
> Yes, either TTI or TLI could be used to limit the transform if we do it in
> CGP rather than the DAG.
>
> The real question I have is whether it is legal to read the extra memory,
> regardless of whether this is a masked load or something else.
>
> Note that the x86 backend already does this, so either my proposal is ok
> for x86, or we're already doing an illegal optimization:
>
> define <4 x i32> @load_bonus_bytes(i32* %addr1, <4 x i32> %v) {
>   %ld1 = load i32, i32* %addr1
>   %addr2 = getelementptr i32, i32* %addr1, i64 3
>   %ld2 = load i32, i32* %addr2
>   %vec1 = insertelement <4 x i32> undef, i32 %ld1, i32 0
>   %vec2 = insertelement <4 x i32> %vec1, i32 %ld2, i32 3
>   ret <4 x i32> %vec2
> }
>
> $ ./llc -o - loadcombine.ll
> ...
>     movups    (%rdi), %xmm0
>     retq
>
>
>
>
> On Thu, Mar 10, 2016 at 10:22 PM, Nema, Ashutosh <Ashutosh.Nema at amd.com>
> wrote:
>
>> This looks interesting, the main motivation appears to be replacing
>> masked vector load with a general vector load followed by a select.
>>
>>
>>
>> Observed masked vector loads are in general expensive in comparison with
>> a vector load.
>>
>>
>>
>> But if first & last element of a masked vector load are guaranteed to be
>> accessed then it can be transformed to a vector load.
>>
>>
>>
>> In opt this can be driven by TTI, where the benefit of this
>> transformation should be checked.
>>
>>
>>
>> Regards,
>>
>> Ashutosh
>>
>>
>>
>> *From:* llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] *On Behalf Of *Sanjay
>> Patel via llvm-dev
>> *Sent:* Friday, March 11, 2016 3:37 AM
>> *To:* llvm-dev
>> *Subject:* [llvm-dev] masked-load endpoints optimization
>>
>>
>>
>> If we're loading the first and last elements of a vector using a masked
>> load [1], can we replace the masked load with a full vector load?
>>
>> "The result of this operation is equivalent to a regular vector load
>> instruction followed by a ‘select’ between the loaded and the passthru
>> values, predicated on the same mask. However, using this intrinsic prevents
>> exceptions on memory access to masked-off lanes."
>>
>> I think the fact that we're loading the endpoints of the vector
>> guarantees that a full vector load can't have any different
>> faulting/exception behavior on x86 and most (?) other targets. We would,
>> however, be reading memory that the program has not explicitly requested.
>>
>> IR example:
>>
>> define <4 x i32> @maskedload_endpoints(<4 x i32>* %addr, <4 x i32> %v) {
>>
>>   ; load the first and last elements pointed to by %addr and shuffle
>> those into %v
>>
>>   %res = call <4 x i32> @llvm.masked.load.v4i32(<4 x i32>* %addr, i32 4,
>> <4 x i1> <i1 1, i1 0, i1 0, i1 1>, <4 x i32> %v)
>>   ret <4 x i32> %res
>> }
>>
>> would become something like:
>>
>>
>> define <4 x i32> @maskedload_endpoints(<4 x i32>* %addr, <4 x i32> %v) {
>>
>>   %vecload = load <4 x i32>, <4 x i32>* %addr, align 4
>>
>>   %sel = select <4 x i1> <i1 1, i1 0, i1 0, i1 1>, <4 x i32> %vecload, <4
>> x i32> %v
>>
>>   ret <4 x i32> %sel
>> }
>>
>> If this isn't valid as an IR optimization, would it be acceptable as a
>> DAG combine with target hook to opt in?
>>
>>
>> [1] http://llvm.org/docs/LangRef.html#llvm-masked-load-intrinsics
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160314/27903f51/attachment.html>


More information about the llvm-dev mailing list