[llvm] [LoopVectorize] Don't scalarize predicated instruction with optsize (PR #129265)
John Brawn via llvm-commits
llvm-commits at lists.llvm.org
Wed Mar 12 06:42:51 PDT 2025
john-brawn-arm wrote:
> Do you have any performance results for benchmarks (aside from the issue described in #66652) when building with -Os? It would be interesting to see if there is any impact. I imagine there might be some improvements too?
Results for llvm-test-suite using Os before and after this patch (run on my desktop with a Intel i9-13900K, best of three runs):
[size.txt](https://github.com/user-attachments/files/19210670/size.txt)
[exec_time.txt](https://github.com/user-attachments/files/19210671/exec_time.txt)
Some of the the large exec_time increases are just measurement error (e.g. LoopInterleavingBenchmarks, where there's no change in generated code), but some are genuine. Looking at the MemFunctions benchmark we essentially have
```
typedef unsigned long size_t;
void BM_MemCmp() {
static constexpr size_t kMaxBufSizeBytes = 4096;
constexpr const size_t kNumElements = kMaxBufSizeBytes / 7;
char p_storage[kNumElements * 7];
char* p = p_storage;
for (int i = 0; i < kNumElements; ++i)
*(p + i * 7) = 0xff;
asm volatile("" : : "r,m"(p) : "memory");
}
```
Tail folding and scalarizing is faster and larger. But using a scalar epilogue would have been both smaller and faster than that. So if we wanted to vectorize this at Os the solution would be to use a scalar epilogue. (Actually this example I've reduced it too much and we don't vectorize this at all because it's not worth it, but I think the point is still correct).
https://github.com/llvm/llvm-project/pull/129265
More information about the llvm-commits
mailing list