[PATCH] D145578: [AArch64] Cost-model vector splat LD1Rs to avoid unprofitable SLP vectorisation
Dave Green via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Fri Mar 10 02:41:35 PST 2023
dmgreen added a reviewer: vporpo.
dmgreen added a comment.
Hello. I had to remind myself where this came from. It looks like it was introduced in D123638 <https://reviews.llvm.org/D123638>, and there were some comments already about the performance not always being ideal. It apparently helped for some <2 x double> vectorization. I'm not sure if there it a perfect answer, but an effective cost of 2 for the throughput of the ld1r would seem to match the hardware better. This doesn't alter isLegalBroadcastLoad and the tests added in D123638 <https://reviews.llvm.org/D123638> don't seem to change.
The cost of a broadcast is already 1 so the code here doesn't seem like it would do much any more. It could be checking the cost type and returning 0 for costsize and 1 for throughput. Otherwise this bit of code could probably be removed.
================
Comment at: llvm/test/Transforms/SLPVectorizer/AArch64/slp-fma-loss.ll:86
define void @slp_profitable_missing_fmf_on_fadd_fsub(ptr %A, ptr %B) {
; CHECK-LABEL: @slp_profitable_missing_fmf_on_fadd_fsub(
; CHECK-NEXT: [[GEP_B_1:%.*]] = getelementptr inbounds float, ptr [[B:%.*]], i64 1
----------------
SjoerdMeijer wrote:
> ABataev wrote:
> > SjoerdMeijer wrote:
> > > SjoerdMeijer wrote:
> > > > Ah, only after uploading this diff I noticed that the function names indicate that this should be profitable... I had missed that.
> > > >
> > > > Hmmm.... I guess that then needs looking into.
> > > Eyeballing this, my first reaction is that I slightly doubt that SLP will be profitable, but I guess that's what I need to find out.
> > Working on fma vectorization support in SLP, hope to spend more time on this later this month.
> It's difficult to see how the SLP variant is ever going to be faster for this example with just a handful of scalar instructions (ignoring the loads/stores) that mostly have overlap in dispatch and execution:
>
> [0,3] D======eeeER .. fmul s4, s1, s0
> [0,4] D======eeeER .. fmul s1, s2, s1
> [0,5] D=======eeeER .. fmul s5, s3, s2
> [0,6] D=======eeeER .. fmul s0, s3, s0
> [0,7] D==========eeER.. fsub s2, s4, s5
> [0,8] D==========eeER.. fadd s0, s0, s1
>
> especially if we have to do things like shuffles:
>
> [0,2] D==eeeeeeeeER . . . .. ld1r.2s { v1 }, [x8], #4
> [0,3] D==========eeeeeeER . . .. ldr d2, [x8]
> [0,4] D================eeeER . .. fmul.2s v1, v1, v2
> [0,5] D================eeeER . .. fmul.2s v0, v2, v0[0]
> [0,6] D===================eeER . .. rev64.2s v1, v1
> [0,7] D=====================eeER .. fsub.2s v3, v0, v1
> [0,8] .D====================eeER .. fadd.2s v0, v0, v1
> [0,9] .D======================eeER .. mov.s v3[1], v0[1]
> [0,10] .D========================eeER.. str d3, [x0]
> [0,11] .D========================eeeeER st1.s { v2 }[1], [x1]
>
> Here I am showing some loads/stores, but that's just to show they are not simple loads/stores anymore but more high-latency instructions, and perhaps more importantly we have got the REV and extract, so with FMAs things might look a bit better but it's difficult to beat the scalar variant. The SLP timeline is a little bit skewed of the post-inc and the result being available a lot earlier, but the bottom line is that there is very little parallelism here as we are working on 2 floats and there's the overhead of the vectorisation.
>
> I have run some micro-benchmarks, and I've measured that the SLP variant is indeed slower.
>
> @fhahn , @dmgreen : I think it makes to also not SLP vectorise this function (and there other 2 below). Do you agree?
I think we will fold the dup into the fmul as opposed to the load now, which seems a little cheaper. https://godbolt.org/z/4aeab8Pas
I'm not sure if that makes it cheaper overall though. I agree that the rev and the mov make the timing tight. From the look of the test it looks like "profitable" here just means that the non-fast version would be slightly more instructions, not that it was known to be profitable. i.e the test is for testing fma combining, this was just a negative test for that issue.
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D145578/new/
https://reviews.llvm.org/D145578
More information about the llvm-commits
mailing list