[llvm] [LoopVectorize] Don't scalarize predicated instruction with optsize (PR #129265)

John Brawn via llvm-commits llvm-commits at lists.llvm.org
Wed Mar 12 06:42:51 PDT 2025


john-brawn-arm wrote:

> Do you have any performance results for benchmarks (aside from the issue described in #66652) when building with -Os? It would be interesting to see if there is any impact. I imagine there might be some improvements too?

Results for llvm-test-suite using Os before and after this patch (run on my desktop with a Intel i9-13900K, best of three runs):
[size.txt](https://github.com/user-attachments/files/19210670/size.txt)
[exec_time.txt](https://github.com/user-attachments/files/19210671/exec_time.txt)
Some of the the large exec_time increases are just measurement error (e.g. LoopInterleavingBenchmarks, where there's no change in generated code), but some are genuine. Looking at the MemFunctions benchmark we essentially have
```
typedef unsigned long size_t;
void BM_MemCmp() {
  static constexpr size_t kMaxBufSizeBytes = 4096;
  constexpr const size_t kNumElements = kMaxBufSizeBytes / 7;

  char p_storage[kNumElements * 7];
  char* p = p_storage;

  for (int i = 0; i < kNumElements; ++i)
    *(p + i * 7) = 0xff;

  asm volatile("" : : "r,m"(p) : "memory");
}
```
Tail folding and scalarizing is faster and larger. But using a scalar epilogue would have been both smaller and faster than that. So if we wanted to vectorize this at Os the solution would be to use a scalar epilogue. (Actually this example I've reduced it too much and we don't vectorize this at all because it's not worth it, but I think the point is still correct).

https://github.com/llvm/llvm-project/pull/129265


More information about the llvm-commits mailing list