[llvm-dev] masked-load endpoints optimization

Fri Mar 11 08:57:10 PST 2016

Thanks, Ashutosh.

Yes, either TTI or TLI could be used to limit the transform if we do it in
CGP rather than the DAG.

The real question I have is whether it is legal to read the extra memory,
regardless of whether this is a masked load or something else.

Note that the x86 backend already does this, so either my proposal is ok
for x86, or we're already doing an illegal optimization:

define <4 x i32> @load_bonus_bytes(i32* %addr1, <4 x i32> %v) {
  %ld1 = load i32, i32* %addr1
  %addr2 = getelementptr i32, i32* %addr1, i64 3
  %ld2 = load i32, i32* %addr2
  %vec1 = insertelement <4 x i32> undef, i32 %ld1, i32 0
  %vec2 = insertelement <4 x i32> %vec1, i32 %ld2, i32 3
  ret <4 x i32> %vec2
}

$ ./llc -o - loadcombine.ll
...
    movups    (%rdi), %xmm0
    retq

On Thu, Mar 10, 2016 at 10:22 PM, Nema, Ashutosh <Ashutosh.Nema at amd.com>
wrote:

> This looks interesting, the main motivation appears to be replacing masked
> vector load with a general vector load followed by a select.
>
>
>
> Observed masked vector loads are in general expensive in comparison with a
> vector load.
>
>
>
> But if first & last element of a masked vector load are guaranteed to be
> accessed then it can be transformed to a vector load.
>
>
>
> In opt this can be driven by TTI, where the benefit of this transformation
> should be checked.
>
>
>
> Regards,
>
> Ashutosh
>
>
>
> *From:* llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] *On Behalf Of *Sanjay
> Patel via llvm-dev
> *Sent:* Friday, March 11, 2016 3:37 AM
> *To:* llvm-dev
> *Subject:* [llvm-dev] masked-load endpoints optimization
>
>
>
> If we're loading the first and last elements of a vector using a masked
> load [1], can we replace the masked load with a full vector load?
>
> "The result of this operation is equivalent to a regular vector load
> instruction followed by a ‘select’ between the loaded and the passthru
> values, predicated on the same mask. However, using this intrinsic prevents
> exceptions on memory access to masked-off lanes."
>
> I think the fact that we're loading the endpoints of the vector guarantees
> that a full vector load can't have any different faulting/exception
> behavior on x86 and most (?) other targets. We would, however, be reading
> memory that the program has not explicitly requested.
>
> IR example:
>
> define <4 x i32> @maskedload_endpoints(<4 x i32>* %addr, <4 x i32> %v) {
>
>   ; load the first and last elements pointed to by %addr and shuffle those
> into %v
>
>   %res = call <4 x i32> @llvm.masked.load.v4i32(<4 x i32>* %addr, i32 4,
> <4 x i1> <i1 1, i1 0, i1 0, i1 1>, <4 x i32> %v)
>   ret <4 x i32> %res
> }
>
> would become something like:
>
>
> define <4 x i32> @maskedload_endpoints(<4 x i32>* %addr, <4 x i32> %v) {
>
>   %vecload = load <4 x i32>, <4 x i32>* %addr, align 4
>
>   %sel = select <4 x i1> <i1 1, i1 0, i1 0, i1 1>, <4 x i32> %vecload, <4
> x i32> %v
>
>   ret <4 x i32> %sel
> }
>
> If this isn't valid as an IR optimization, would it be acceptable as a DAG
> combine with target hook to opt in?
>
>
> [1] http://llvm.org/docs/LangRef.html#llvm-masked-load-intrinsics
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160311/d5ea5b7c/attachment.html>