<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/59274>59274</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
CPU2000/171.swim performance regression on aarch64 after D137580
</td>
</tr>
<tr>
<th>Labels</th>
<td>
performance,
vectorization
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
vzakhari
</td>
</tr>
</table>
<pre>
With https://reviews.llvm.org/D137580 Flang started propagating all fast-math flags to LLVM (before the change Flang only passed `ninf` and `contract`).
The benchmark used to run for about 26 seconds, and after the change is takes about 28 seconds on Ampere Altra - about 7.5% slowdown.
perf identified the following difference in `_QQmain`:
<table>
<tr>
<td>before</td>
<td>after</td>
</tr>
<tr>
<td>
```
Children Self Samples Command Shared Object Symbol
+ 40.76% 40.55% 42639 swim swim [.] _QQmain
+ 22.55% 22.55% 23718 swim swim [.] calc2_
+ 19.14% 19.13% 20115 swim swim [.] calc1_
+ 17.30% 17.30% 18184 swim swim [.] calc3_
```
</td>
<td>
```
Children Self Samples Command Shared Object Symbol
44.37% 44.15% 50484 swim swim [.] _QQmain
+ 21.32% 21.31% 24375 swim swim [.] calc2_
+ 17.83% 17.82% 20378 swim swim [.] calc1_
16.27% 16.27% 18601 swim swim [.] calc3_
```
</td>
</tr>
<tr>
<td>
```
│310:┌─→ldr d0, [x14]
10365 │ │ add x14, x14, x29
│ │ ldr d1, [x15]
11187 │ │ add x15, x15, x29
│ │ ldr d2, [x16]
19118 │ │ add x16, x16, x29
│ │ fabs d0, d0
420 │ │ subs x13, x13, #0x1
│ │ fabs d1, d1
412 │ │ fabs d2, d2
│ │ fadd d10, d10, d0
537 │ │ fadd d9, d9, d1
│ │ fadd d8, d8, d2
539 │ └──b.ne 310
```
</td>
<td>
```
181 │3e0:┌─→ldr d3, [x17]
3895 │ │ subs x1, x1, #0x2
114 │ │ ldr d4, [x18]
5771 │ │ add x18, x18, x25
61 │ │ ldr d5, [x0]
13664 │ │ add x17, x17, x25
124 │ │ ldr d6, [x0, #10680]
11955 │ │ fabs d3, d3
163 │ │ ldr d7, [x2]
7542 │ │ fabs d4, d4
80 │ │ ldr d16, [x3]
5377 │ │ fabs d5, d5
46 │ │ fabs d6, d6
96 │ │ add x3, x3, x25
39 │ │ fabs d7, d7
205 │ │ fadd d10, d3, d10
135 │ │ fabs d16, d16
208 │ │ fadd d2, d4, d2
85 │ │ fadd d9, d5, d9
141 │ │ add x2, x2, x25
56 │ │ fadd d1, d6, d1
68 │ │ add x0, x0, x25
49 │ │ fadd d8, d7, d8
214 │ │ fadd d0, d16, d0
163 │ └──b.ne 3e0
```
</td>
</tr>
</table>
The difference is caused by `LoopVectorizePass` that unrolls the loop by 2 and ends up not vectorizing it.
The attached files provide LLVM IR for `_QQmain`:
* [main.ll.gz](https://github.com/llvm/llvm-project/files/10128447/main.ll.gz) - original IR with `fast`
* [main_nofast.ll.gz](https://github.com/llvm/llvm-project/files/10128448/main_nofast.ll.gz) - modified IR with `fast` replaced by `ninf contract` just for this loop; this restores performance to 26 seconds.
The vectorizer behavior may be reproduced with: `clang -cc1 -triple aarch64-unknown-linux-gnu -emit-obj --mrelax-relocations -disable-free -clear-ast-before-backend -mrelocation-model pic -pic-level 2 -pic-is-pie -mframe-pointer=non-leaf -menable-no-infs -menable-no-nans -fapprox-func -funsafe-math-optimizations -fno-signed-zeros -mreassociate -freciprocal-math -ffp-contract=fast -fno-rounding-math -ffast-math -ffinite-math-only -mconstructor-aliases -funwind-tables=2 -target-cpu generic -target-feature +neon -target-feature +v8.2a -target-abi aapcs -mllvm -treat-scalable-fixed-error-as-warning -debugger-tuning=gdb -v -Ofast -ferror-limit 19 -fopenmp -fno-signed-char -fgnuc-version=4.2.1 -fcolor-diagnostics -vectorize-loops -vectorize-slp -faddrsig -D__GCC_HAVE_DWARF2_CFI_ASM=1 -o main.o -x ir main.ll`
@kiranchandramohan, can you please take a look? Is there something obviously wrong with the generated code?
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJy9WFlv4zgS_jXyC0FBou4HP6Ttzm4DPZijd3seDUqibHYk0iCpHP3rt0gdlpM4TmZn1zAoklIdrK-qpKpS1k_rP7k5oIMxR-1FNx65hb9i95w9aL9t7ztfqj1sbcMoS_IA3bZU7JE2VBlWo6OSR7qnhsMebVvUUG1wR4Fh09K9Rkair1-__4I8kpeskYohc2CoOgAPNrKSon1CR6o1sPPSQHDRwAVR4ZaVFEbRysDUI4XvBVsvuBnGfwGnkonq0FF1h3pLD-JULxAIQrSUvUEkRZoBj1p7ZON40sYwtdSCg5b0jumJIp8oQDN00x0ZKH3TghIIj49kfuKRBOlWPtTyQZwpBY83iNdMGN5wqxEIamQLj1oT1bxpgJ-oQKywx9v9_ntHubCni0YOXrQxtGyZF30-baizVQ2rwZywAmzcxvltd8zX7toNdYX38kBWteHvlghtDryt4QzI_b6xtkHjlHbHFuyINrLrrKnRtwNVYINfyx-sMtNTT10p25E3-QQ7ceBnqTXoME-Sce5-MUmjAiH9wDu3nieLn5d88r1kiyZjLngTcuK3nLsfibIwfyfvirYV2S1Zh4UfxiM7O4_OWAdhmHyAdXjOOvOjYGK9mLtfmId5_AHW0e5VIM8c4qX__L9cwGEc-1E2OUDsh2cgJUH87uO-5gChH5HJAWAenqEUQ0776w6Q-Xl0Qikn5w4QZR_xrckBHLPUJ5M9lvMR_jQI_5fw_9cJYlDhM_GK2MtJFLrUNi430ySwkzz1CgKeBM_XgU3QoPgjBFWyHXmFQZQmJ15njBGiNTgYsgRAOl1I8boezxaj1HCWmiykhmGeXZWaDFKTvyCVzFLThdQCxF6Vmg5S049IbWipJwvXJ5hiElwk0b0lebRJbTNdPBIFj-FHJDrr1ieSOCTXSJxpYHyfFGuUOhwOFj4_XxJdBnEkLRxJ8UzNd8jMHUn-TNkEXlWvEcJ_6fZuUvqCIWSj4-_NzrMy8J5YRCF7XxRGs2dmJ89EUV5cjsLJVQZPmR3lZJUwjK8FRDyLzRdikywLrwZEPsgdLiQ5gZheph2lJpPUYBGFUZpeVncSmg1Cs-dCQ3L1qOlJ6GCqMEjzpQJhkVw29hglDiYYZ7lpdE1uNsklCwtnSXw1Ih02MM6GzS-njSmrzoeMlnBG2VsB6YQ5SOoFinF6jcTJgnEmKS6TjPgNWS164TIXwnchzFkRxomEBG9htUxP0ZSkZsyiqzAPdrSXk7zLr4hRHhnxepZH86uKFqPth5Q4qxlfjUEyGPKFOZO3sBtsM2L3LP-mV9-DzqbjuBQZv4XgInNnY_6e7fpGjhoJgwmL87fMpdB7I-ezj-b8F59kduO8Rpwr4mWJqeEL0NXF5ZMtNr9KefwOn-FS8Z_sN6i4bZ1tDtSgXiioUbWrVlt4yhIQVy4zWwj3RySkQfcjra1kuXlRi1NjaHUAaQ23NcBRyXsohIf6_8sfrii_UPGSG5ss7Lbftv7-p00aJD9vSey5OfSlX8kOFrYvMV4wyLG1BSydXLiGQUjyOAaMbxc8SQEFPKi_54K2VqEH2_cAPWzP4gTCSZedkPbW36lSPqp0ztpp1sl66Bi81AwpdmxpNeNoWyRo0RhBP3ptnH3NATC3AHrRp2GhmAbMLBxMwRNQlIFjGLlojLzAcYKZKVSyA73nwLijT7CwiihZ91YVqyNYwvVoXB8HV1WIsFEcKkBEqaoOaYx7cSfkg8AtF_0j3oseYdZxg2X5A2HcKdbSRwyjrKjhUmiEa66tZ-NGMQY8W0YVtk2lod-BS1rdgU8iRztSYbAda9GRVwjDgFt2D0syLLiGC3DqGkU7ho-SC9cZ2QogBO4N3GLCiRQSg2X12YagVqmGHuHgj7jpBciAUdOGuT4XlkfDO_5zUr8BGs33gtUYDCi1UxRCTVacGlADjlVx4AWl2dAnw01zxDOY0dZiPrBRshc1xNr83NxbgzkX3Ewa2AYa7oCHNqq32GHacqqZdqo-cFFjly3Ac7dgFUPVnhlcHXu0Z4Ipa7Zxr2HU9IrBZ8knwaR4bf8-9wmdb9CSA9bHyp7Tur51AHgWazjfACN_BFMwpaxWGj9QJWz6wDUr-_2eKWx6uwGa7esS4XuEfx0tMNC0YFwDtRFsyCMT3fHMxNWBKtgAv6rwPVMaMABOsU98cMWmki1wqDndC6kNt0rOvo1tlJxt6NbyhlwPbEDB7W73j81m98-b75932z9v_rglu83tl93Nt19AAnCXyOUWifAj4gqNieZ5Mo-DO64g5g6QS8H_JFzt-6OiAj3JHkGoAE6u_4ioDdw7L7pF6IvLxGBvLTsGYWy7pCXEYa8B6QclYe1ShE3XDkJqe7EVRAGQr9g6TNMiT7IiDlf1OqqLqKArw03L1pvf_k2CILAZKQt91ztYJgbF9pArrBlt83MM4rFlOjaAV71q1x_Og1zr3iXCpCBZvDqs0xTyRpqGad6wMmNRSpKqCZIop2GVFDlZgfuwVq8hF3uELHSE1fDhTCboXOjZ7WS74msSEBKSwCbcJEr9PIwClpd5nTVZWUQEEGGAVTu3tldq7XQFd9Rws-XanPreK4hc62nM6QH8aW8OUq3vf9I7cD2-cudau0P9B30uJP8">