[PATCH] D131028: [AArch64] Fix cost model for FADD vector reduction

Fri Aug 5 14:14:15 PDT 2022

fhahn added a comment.

In D131028#3702132 <https://reviews.llvm.org/D131028#3702132>, @dmgreen wrote:

>> Yeah this is an unfortunate potential impact on the SLP vectorizer :(
>>
>> I doubt the improved costs here should make things *much* worse in practice and we already have the same issue with integer `add` reduction and `mla` IIUC.  Should any negative impact materialize, I think we should work around an SLP issue in the SLPVectorizer directly, rather than through artificially inflating costs in TTI.
>>
>> It might also increase the incentives to properly addressing the issue :)
>>
>> The motivating use case for those improvements is using more accurate costs in other passes, like D131125 <https://reviews.llvm.org/D131125>
>
> Yeah - I worry that this might come up quite a lot. Adding floats together is pretty common, and multiplying them beforehand seems just as prevalent. I have this example, although it's maybe a little odd due to the extra shuffling in the loop: https://godbolt.org/z/3oqT1b58f.

Thanks for sharing the example! In this particular example with the patch we will use a vector fmul feeding an fadd reduction, but on a first glance this doesn't seem worse and maybe even slightly better overall. Here's the diff between the example with and without the patch (generated by `diff base.s patch.s`)

  diff  a.s b.s
  17,21c17,19
  < 	ldp	s0, s1, [x10]
  < 	ldp	s2, s3, [x10, #8]
  < 	ldr	s4, [x10, #16]
  < 	ldp	s6, s18, [x11]
  < 	ldp	s5, s17, [x11, #8]
  ---
  > 	ldr	s1, [x10]
  > 	ldur	q2, [x10, #4]
  > 	ldr	q5, [x11]
  25c23
  < 	fmov	s7, s6
  ---
  > 	mov.16b	v0, v5
  27,34c25,32
  < 	ldr	s6, [x1, x13]
  < 	fmov	s16, s5
  < 	fmul	s5, s7, s1
  < 	fmadd	s5, s18, s2, s5
  < 	fmadd	s5, s16, s3, s5
  < 	fmadd	s5, s17, s4, s5
  < 	fmadd	s5, s6, s0, s5
  < 	str	s5, [x2, x13]
  ---
  > 	ldr	s4, [x1, x13]
  > 	fmul.4s	v3, v5, v2
  > 	faddp.4s	v3, v3, v3
  > 	faddp.2s	s3, v3
  > 	fmadd	s3, s4, s1, s3
  > 	str	s3, [x2, x13]
  > 	trn1.4s	v5, v4, v5
  > 	mov.s	v5[2], v3[0]
  36,37d33
  < 	fmov	s17, s16
  < 	fmov	s18, s7
  43,46c39,41
  < 	str	s6, [x11]
  < 	str	s7, [x11, #4]
  < 	str	s5, [x11, #8]
  < 	str	s16, [x11, #12]
  ---
  > 	stp	s4, s0, [x11]
  > 	add	x12, x11, #12
  > 	str	s3, [x11, #8]
  47a43
  > 	st1.s	{ v0 }[2], [x12]

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D131028/new/

https://reviews.llvm.org/D131028