[PATCH] D145578: [AArch64] Cost-model vector splat LD1Rs to avoid unprofitable SLP vectorisation
Sjoerd Meijer via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Wed Mar 8 07:16:40 PST 2023
SjoerdMeijer added inline comments.
================
Comment at: llvm/test/Transforms/SLPVectorizer/AArch64/slp-fma-loss.ll:86
define void @slp_profitable_missing_fmf_on_fadd_fsub(ptr %A, ptr %B) {
; CHECK-LABEL: @slp_profitable_missing_fmf_on_fadd_fsub(
; CHECK-NEXT: [[GEP_B_1:%.*]] = getelementptr inbounds float, ptr [[B:%.*]], i64 1
----------------
ABataev wrote:
> SjoerdMeijer wrote:
> > SjoerdMeijer wrote:
> > > Ah, only after uploading this diff I noticed that the function names indicate that this should be profitable... I had missed that.
> > >
> > > Hmmm.... I guess that then needs looking into.
> > Eyeballing this, my first reaction is that I slightly doubt that SLP will be profitable, but I guess that's what I need to find out.
> Working on fma vectorization support in SLP, hope to spend more time on this later this month.
It's difficult to see how the SLP variant is ever going to be faster for this example with just a handful of scalar instructions (ignoring the loads/stores) that mostly have overlap in dispatch and execution:
[0,3] D======eeeER .. fmul s4, s1, s0
[0,4] D======eeeER .. fmul s1, s2, s1
[0,5] D=======eeeER .. fmul s5, s3, s2
[0,6] D=======eeeER .. fmul s0, s3, s0
[0,7] D==========eeER.. fsub s2, s4, s5
[0,8] D==========eeER.. fadd s0, s0, s1
especially if we have to do things like shuffles:
[0,2] D==eeeeeeeeER . . . .. ld1r.2s { v1 }, [x8], #4
[0,3] D==========eeeeeeER . . .. ldr d2, [x8]
[0,4] D================eeeER . .. fmul.2s v1, v1, v2
[0,5] D================eeeER . .. fmul.2s v0, v2, v0[0]
[0,6] D===================eeER . .. rev64.2s v1, v1
[0,7] D=====================eeER .. fsub.2s v3, v0, v1
[0,8] .D====================eeER .. fadd.2s v0, v0, v1
[0,9] .D======================eeER .. mov.s v3[1], v0[1]
[0,10] .D========================eeER.. str d3, [x0]
[0,11] .D========================eeeeER st1.s { v2 }[1], [x1]
Here I am showing some loads/stores, but that's just to show they are not simple loads/stores anymore but more high-latency instructions, and perhaps more importantly we have got the REV and extract, so with FMAs things might look a bit better but it's difficult to beat the scalar variant. The SLP timeline is a little bit skewed of the post-inc and the result being available a lot earlier, but the bottom line is that there is very little parallelism here as we are working on 2 floats and there's the overhead of the vectorisation.
I have run some micro-benchmarks, and I've measured that the SLP variant is indeed slower.
@fhahn , @dmgreen : I think it makes to also not SLP vectorise this function (and there other 2 below). Do you agree?
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D145578/new/
https://reviews.llvm.org/D145578
More information about the llvm-commits
mailing list