<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/59274>59274</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            CPU2000/171.swim performance regression on aarch64 after D137580
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            performance,
            vectorization
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          vzakhari
      </td>
    </tr>
</table>

<pre>
    With https://reviews.llvm.org/D137580 Flang started propagating all fast-math flags to LLVM (before the change Flang only passed `ninf` and `contract`).

The benchmark used to run for about 26 seconds, and after the change is takes about 28 seconds on Ampere Altra - about 7.5% slowdown.

perf identified the following difference in `_QQmain`:
<table>
<tr>
<td>before</td>
<td>after</td>
</tr>
<tr>
<td>

```
  Children      Self       Samples  Command  Shared Object       Symbol
+   40.76%    40.55%         42639  swim     swim                [.] _QQmain
+   22.55%    22.55%         23718  swim     swim                [.] calc2_
+   19.14%    19.13%         20115  swim     swim                [.] calc1_
+   17.30%    17.30%         18184  swim     swim                [.] calc3_
```

</td>
<td>

```
  Children      Self       Samples  Command  Shared Object       Symbol
    44.37%    44.15%         50484  swim     swim                [.] _QQmain
+   21.32%    21.31%         24375  swim     swim                [.] calc2_
+   17.83%    17.82%         20378  swim     swim                [.] calc1_
    16.27%    16.27%         18601  swim     swim                [.] calc3_
```

</td>
</tr>
<tr>
<td>

```
       │310:┌─→ldr   d0, [x14]
 10365 │    │  add   x14, x14, x29
       │    │  ldr   d1, [x15]
 11187 │    │  add   x15, x15, x29
       │    │  ldr   d2, [x16]
 19118 │    │  add   x16, x16, x29
       │    │  fabs  d0, d0
   420 │    │  subs  x13, x13, #0x1
       │    │  fabs  d1, d1
   412 │    │  fabs  d2, d2
       │    │  fadd  d10, d10, d0
   537 │    │  fadd  d9, d9, d1
       │    │  fadd  d8, d8, d2
   539 │    └──b.ne  310
```

</td>
<td>

```

   181 │3e0:┌─→ldr   d3, [x17]
  3895 │    │  subs  x1, x1, #0x2
   114 │    │  ldr   d4, [x18]
  5771 │    │  add   x18, x18, x25
    61 │    │  ldr   d5, [x0]
 13664 │    │  add   x17, x17, x25
   124 │    │  ldr   d6, [x0, #10680]
 11955 │    │  fabs  d3, d3
   163 │    │  ldr   d7, [x2]
  7542 │    │  fabs  d4, d4
    80 │    │  ldr   d16, [x3]
  5377 │    │  fabs  d5, d5
    46 │    │  fabs  d6, d6
    96 │    │  add   x3, x3, x25
    39 │    │  fabs  d7, d7
   205 │    │  fadd  d10, d3, d10
   135 │    │  fabs  d16, d16
   208 │    │  fadd  d2, d4, d2
    85 │    │  fadd  d9, d5, d9
   141 │    │  add   x2, x2, x25
    56 │    │  fadd  d1, d6, d1
    68 │    │  add   x0, x0, x25
    49 │    │  fadd  d8, d7, d8
   214 │    │  fadd  d0, d16, d0
   163 │    └──b.ne  3e0
```

</td>
</tr>
</table>

The difference is caused by `LoopVectorizePass` that unrolls the loop by 2 and ends up not vectorizing it.

The attached files provide LLVM IR for `_QQmain`:
* [main.ll.gz](https://github.com/llvm/llvm-project/files/10128447/main.ll.gz) - original IR with `fast`
* [main_nofast.ll.gz](https://github.com/llvm/llvm-project/files/10128448/main_nofast.ll.gz) - modified IR with `fast` replaced by `ninf contract` just for this loop; this restores performance to 26 seconds.

The vectorizer behavior may be reproduced with: `clang -cc1 -triple aarch64-unknown-linux-gnu -emit-obj --mrelax-relocations -disable-free -clear-ast-before-backend -mrelocation-model pic -pic-level 2 -pic-is-pie -mframe-pointer=non-leaf -menable-no-infs -menable-no-nans -fapprox-func -funsafe-math-optimizations -fno-signed-zeros -mreassociate -freciprocal-math -ffp-contract=fast -fno-rounding-math -ffast-math -ffinite-math-only -mconstructor-aliases -funwind-tables=2 -target-cpu generic -target-feature +neon -target-feature +v8.2a -target-abi aapcs -mllvm -treat-scalable-fixed-error-as-warning -debugger-tuning=gdb -v -Ofast -ferror-limit 19 -fopenmp -fno-signed-char -fgnuc-version=4.2.1 -fcolor-diagnostics -vectorize-loops -vectorize-slp -faddrsig -D__GCC_HAVE_DWARF2_CFI_ASM=1 -o main.o -x ir main.ll`

@kiranchandramohan, can you please take a look?  Is there something obviously wrong with the generated code?
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJy9WFlv4zgS_jXyC0FBou4HP6Ttzm4DPZijd3seDUqibHYk0iCpHP3rt0gdlpM4TmZn1zAoklIdrK-qpKpS1k_rP7k5oIMxR-1FNx65hb9i95w9aL9t7ztfqj1sbcMoS_IA3bZU7JE2VBlWo6OSR7qnhsMebVvUUG1wR4Fh09K9Rkair1-__4I8kpeskYohc2CoOgAPNrKSon1CR6o1sPPSQHDRwAVR4ZaVFEbRysDUI4XvBVsvuBnGfwGnkonq0FF1h3pLD-JULxAIQrSUvUEkRZoBj1p7ZON40sYwtdSCg5b0jumJIp8oQDN00x0ZKH3TghIIj49kfuKRBOlWPtTyQZwpBY83iNdMGN5wqxEIamQLj1oT1bxpgJ-oQKywx9v9_ntHubCni0YOXrQxtGyZF30-baizVQ2rwZywAmzcxvltd8zX7toNdYX38kBWteHvlghtDryt4QzI_b6xtkHjlHbHFuyINrLrrKnRtwNVYINfyx-sMtNTT10p25E3-QQ7ceBnqTXoME-Sce5-MUmjAiH9wDu3nieLn5d88r1kiyZjLngTcuK3nLsfibIwfyfvirYV2S1Zh4UfxiM7O4_OWAdhmHyAdXjOOvOjYGK9mLtfmId5_AHW0e5VIM8c4qX__L9cwGEc-1E2OUDsh2cgJUH87uO-5gChH5HJAWAenqEUQ0776w6Q-Xl0Qikn5w4QZR_xrckBHLPUJ5M9lvMR_jQI_5fw_9cJYlDhM_GK2MtJFLrUNi430ySwkzz1CgKeBM_XgU3QoPgjBFWyHXmFQZQmJ15njBGiNTgYsgRAOl1I8boezxaj1HCWmiykhmGeXZWaDFKTvyCVzFLThdQCxF6Vmg5S049IbWipJwvXJ5hiElwk0b0lebRJbTNdPBIFj-FHJDrr1ieSOCTXSJxpYHyfFGuUOhwOFj4_XxJdBnEkLRxJ8UzNd8jMHUn-TNkEXlWvEcJ_6fZuUvqCIWSj4-_NzrMy8J5YRCF7XxRGs2dmJ89EUV5cjsLJVQZPmR3lZJUwjK8FRDyLzRdikywLrwZEPsgdLiQ5gZheph2lJpPUYBGFUZpeVncSmg1Cs-dCQ3L1qOlJ6GCqMEjzpQJhkVw29hglDiYYZ7lpdE1uNsklCwtnSXw1Ih02MM6GzS-njSmrzoeMlnBG2VsB6YQ5SOoFinF6jcTJgnEmKS6TjPgNWS164TIXwnchzFkRxomEBG9htUxP0ZSkZsyiqzAPdrSXk7zLr4hRHhnxepZH86uKFqPth5Q4qxlfjUEyGPKFOZO3sBtsM2L3LP-mV9-DzqbjuBQZv4XgInNnY_6e7fpGjhoJgwmL87fMpdB7I-ezj-b8F59kduO8Rpwr4mWJqeEL0NXF5ZMtNr9KefwOn-FS8Z_sN6i4bZ1tDtSgXiioUbWrVlt4yhIQVy4zWwj3RySkQfcjra1kuXlRi1NjaHUAaQ23NcBRyXsohIf6_8sfrii_UPGSG5ss7Lbftv7-p00aJD9vSey5OfSlX8kOFrYvMV4wyLG1BSydXLiGQUjyOAaMbxc8SQEFPKi_54K2VqEH2_cAPWzP4gTCSZedkPbW36lSPqp0ztpp1sl66Bi81AwpdmxpNeNoWyRo0RhBP3ptnH3NATC3AHrRp2GhmAbMLBxMwRNQlIFjGLlojLzAcYKZKVSyA73nwLijT7CwiihZ91YVqyNYwvVoXB8HV1WIsFEcKkBEqaoOaYx7cSfkg8AtF_0j3oseYdZxg2X5A2HcKdbSRwyjrKjhUmiEa66tZ-NGMQY8W0YVtk2lod-BS1rdgU8iRztSYbAda9GRVwjDgFt2D0syLLiGC3DqGkU7ho-SC9cZ2QogBO4N3GLCiRQSg2X12YagVqmGHuHgj7jpBciAUdOGuT4XlkfDO_5zUr8BGs33gtUYDCi1UxRCTVacGlADjlVx4AWl2dAnw01zxDOY0dZiPrBRshc1xNr83NxbgzkX3Ewa2AYa7oCHNqq32GHacqqZdqo-cFFjly3Ac7dgFUPVnhlcHXu0Z4Ipa7Zxr2HU9IrBZ8knwaR4bf8-9wmdb9CSA9bHyp7Tur51AHgWazjfACN_BFMwpaxWGj9QJWz6wDUr-_2eKWx6uwGa7esS4XuEfx0tMNC0YFwDtRFsyCMT3fHMxNWBKtgAv6rwPVMaMABOsU98cMWmki1wqDndC6kNt0rOvo1tlJxt6NbyhlwPbEDB7W73j81m98-b75932z9v_rglu83tl93Nt19AAnCXyOUWifAj4gqNieZ5Mo-DO64g5g6QS8H_JFzt-6OiAj3JHkGoAE6u_4ioDdw7L7pF6IvLxGBvLTsGYWy7pCXEYa8B6QclYe1ShE3XDkJqe7EVRAGQr9g6TNMiT7IiDlf1OqqLqKArw03L1pvf_k2CILAZKQt91ztYJgbF9pArrBlt83MM4rFlOjaAV71q1x_Og1zr3iXCpCBZvDqs0xTyRpqGad6wMmNRSpKqCZIop2GVFDlZgfuwVq8hF3uELHSE1fDhTCboXOjZ7WS74msSEBKSwCbcJEr9PIwClpd5nTVZWUQEEGGAVTu3tldq7XQFd9Rws-XanPreK4hc62nM6QH8aW8OUq3vf9I7cD2-cudau0P9B30uJP8">