<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/135575>135575</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
[LoopVectorize] Non-beneficial vectorization involving lookup table
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
vlasakm
</td>
</tr>
</table>
<pre>
Consider the following input IR:
```llvm
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128-ni:1-p2:32:8:8:32-ni:2"
target triple = "x86_64-unknown-linux-gnu"
@L = internal unnamed_addr constant [256 x i16] [i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1, i16 0, i16 1], align 16
define noundef i64 @f(ptr noundef align 8 %arg, i64 noundef %arg1) {
bb:
%i = icmp eq i64 %arg1, 0
br i1 %i, label %bb16, label %bb2
bb2: ; preds = %bb2, %bb
%i3 = phi i64 [ 0, %bb ], [ %i14, %bb2 ]
%i4 = phi i64 [ 0, %bb ], [ %i13, %bb2 ]
%i5 = getelementptr inbounds nuw i8, ptr %arg, i64 %i3
%i6 = load i8, ptr %i5, align 1
%i7 = zext i8 %i6 to i64
%i8 = getelementptr inbounds nuw i16, ptr @L, i64 %i7
%i9 = load i16, ptr %i8, align 2
%i10 = lshr i16 %i9, 1
%i11 = and i16 %i10, 1
%i12 = zext nneg i16 %i11 to i64
%i13 = add i64 %i4, %i12
%i14 = add nuw nsw i64 %i3, 1
%i15 = icmp samesign ult i64 %i14, %arg1
br i1 %i15, label %bb2, label %bb16
bb16: ; preds = %bb2, %bb
%i17 = phi i64 [ 0, %bb ], [ %i13, %bb2 ]
ret i64 %i17
}
```
Whose original program would be something like:
```c
size_t f(const unsigned char* str, size_t len) {
size_t count = 0;
for (size_t i = 0; i < len; i++) {
unsigned char c = str[i];
total += (L[c] & 0x0002) != 0;
}
return count;
}
```
(it's a simplified adaptation of Java code which tries to find the number of uppercase letters in Latin 1 string, with uninteresting details removed).
In O3 pipeline with a sufficient target, this gets vectorized ([Godbolt link](https://godbolt.org/z/r8d56bE1s)), and since https://github.com/llvm/llvm-project/pull/108190, there is also vectorized epilogue.
With loop vectorization turned off, 8 iterations get unrolled ([Godbolt link](https://godbolt.org/z/x5WPfG6qa)), and this version runs ~1.85x faster:
```
Benchmark 1: ./base
Time (mean ± σ): 7.481 s ± 0.288 s [User: 7.479 s, System: 0.001 s]
Range (min … max): 7.336 s … 8.143 s 10 runs
Benchmark 2: ./unrolled
Time (mean ± σ): 5.346 s ± 0.194 s [User: 5.343 s, System: 0.002 s]
Range (min … max): 5.249 s … 5.856 s 10 runs
Benchmark 3: ./vectorized
Time (mean ± σ): 9.890 s ± 0.346 s [User: 9.888 s, System: 0.001 s]
Range (min … max): 9.618 s … 10.373 s 10 runs
Summary
./unrolled ran
1.40 ± 0.07 times faster than ./base
1.85 ± 0.09 times faster than ./vectorized
```
(warnings about outliers removed for brevity).
Steps to reproduce:
```bash
opt -S -O3 -mcpu=sapphirerapids base.ll -o vectorized.ll
opt -S -O3 -mcpu=sapphirerapids -vectorize-loops=false base.ll -o unrolled.ll
for f in *.ll; do llc -mcpu=sapphirerapids --relocation-model=pic $f -o ${f%.ll}.s; done
for f in *.s; do cc bench.c $f -o ${f%.s}; done
hyperfine -w 2 ./base ./unrolled ./vectorized
```
```c
// bench.c
#include <stddef.h>
extern size_t f(const char *str, size_t len);
const char str[] = \
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n"
"A very long string that is just same line copied multiple times\n";
int main(int argc, char **argv) {
size_t len = sizeof(str);
for (size_t i = 0; i < 10000000; i++) {
f(str, len);
}
}
```
Hardware:
```
Model name: INTEL(R) XEON(R) GOLD 5512U
```
The vectorized version is faster than the base case, but still slower than what we can get _without_ vectorization in this case.
</pre>
<img width="1" height="1" alt="" src="http://email.email.llvm.org/o/eJzsWk1z2zjS_jXwpUssEPwQefBBsuN585Y32Uoym72lILIpYgwCHACU7Bz2t28BFCnJdibrSu1lSyrFARuNB_0AjSa6bW6t2CrEa5KtSXZ7xQfXanO9k9zyh-5qo-un6xutrKjRgGsRGi2l3gu1BaH6wcH7TyRZERq-OR2_Uu46QleOmy06qLnjkj_pwQFJboEwhouOJCtc9GxJSbJKWPjhH-PzR9_K0_BjIeZGzAqSrGJWLJqCHloqiPJpeJ4uPgex8OJFz6aO4vAvYWMfI4wdTXVG9BInMx-L_FueLgb1oPReLaRQw-Niq4ZxiP-m9D4oC-XQKC5hUIp3WH_jdW2g0so6rhyQbM2yHB5BxDnJbv2ziHOghN14EcRT4yK5SC6Si-QiuUgukovkf1SS3fo2l2KrIM7Hu1SNjVAISg-qxgZEngJJaUNY0Tszi8cxBRCWcbMNiHk6947SmLASyHJN6GqzGS-n4LvEeFWruh7wz3GCSf8GaNDaGBBx0PUyyTco_dNmE-fPBIcboG8kK3jThyRr6A3W9nDPDGjsZmzNxiaht2_FaGm2HhcwKMFhBb3U68bp3MVC3wSSvgUk-QFIFkC26FBih8r5_RBq49fcghr2IAo_0ovPtyXQmGHyACM1r88HiOzEGWbtZdD-jo8ORHEY77SHnVWKn9o17lqYJ6X3p2YtZ5TyxKwTfY9_tIvN6jEd9W1rgjMHCK94ND2OgwpX9awR02cq7EhPKdweFePnLOPREXhdz8ZPuy3iE7vSWc1TV3Z_sgXnc2fHc2B5hz7_g0G6WX_2pnA2zo9FnL04By8OyuFghGTsv3Aw4uWvOrXBE7LeE8jy9jR9HRl8bbVF0EZshc_seqO3hnew14OsYYNgdYeu9YmwFA_4MgmuCF1Z8R2_OfBRLCSDMKiQbtdQtdwQtgLrjLfwoChRHYOXX5KDvNKDGrNnSpKpr9HeT4uDipi7Q_MmYPkHwtbhewrrP2emQBWGe2uytfALlZzqOu243-D1uDHFPcnWVchkWQ70kVLKwgQsfmbjuLK-ZdANRo1Mxv5XV52wQjjClhY4WNH1UjQCa-A17x13QivQDfw_33GodI2wb0XV-rQdrT84jVA1uBZBDd0Gjdcd-h5NxS2CROfQWBAK7rkTCmLPV6gQr_bCtTCokMijdX5ba3RcSAsGO73DmrAyGm18r-BjAr3oUfpXVhjKwQ5NIyqBysFYTfCwrhXWhygLO6ycNuI71n4BSbb-TdcbLR1IoR6C1xatc731jsTuCLvbjv2R9gH17jthd6aos3zzLraEleF7E4KMFapCeDZYuHbYRJXuCLsL5Zjxv0Vv9B9YOcLu-kFKwu5iWsQlHW1FgyAscGn1qb3YC6m3Ax7of_V8pdb9rDNujN9frEE3jUcrQDg0oSesAAzKaCl_gf5j9vXvzW_5n_ycfljiHRrrbTCDsvCvOCqyR2i4dWheHkxCV2tUVdtx8wCxD1IRYXcbbjG46hfRobexQ66A3DCyjoHc3JEi8ZNOIW0ZpUUMdlIAGrGiABt8Plv_bsPEXmtZgvWmfn6yDjsvpBGlMdgpGn3iajtOKBSQd4wUlKxy6Pjjcb5llCS5n23uhiKK02ScMKaB90jzSI1N1KaVfwO9LErS_IxeXKYv6Hmt5DV67G30soil5Tm9LCqy_C_pJRO9o6e-gWAZFSU9IzgSPidYRoXf1V_evzLK4-KMYEyjZPna_n0euo6bpwB9unlguDqE0jhK6WQ4jegSnOjQHvwdXMvVuUf7EUV2MqL8wYizlXwZmPfcKKG2FvhGDw704KTwAfUQIMMLaWNwJ9zTMVh-dtiH2GywN7oeqldelRtuW0JXunew-AyLjwksuqofSHJred-3wqDhvagteEqRlLA4jU-RlP_R4MU8ZOGjlyXJbcOlxVPUabVHTE-o8e8LwlZekqyh1iBl9cMpFgalrkLUW3S6RkmS215UQFjaeHzCUrJcN4RlHm95G9kRVOHz6Q4dUFWw8T4fvQpi_Uv0iNA-9WhCHrXYA5u94NyRfrrTpzeYMRhPJgRBIlQlhxr9LcO6usYmaknybhyMjw6NghcXn3DJIGz12n3ncB2gqxPV8SYSrhj-ypHdBI2SMLbywf4JpFbbw_vbO7DzL64_BuvCnRbCi7nSvb87dIN0oawfnJ5kN2oq4F_gLnAXuAvcBe4Cd4G7wF3gLnAXuAvcr8BNuYxQDjouFGGFb3KzrXziM-VBhK242e6OZVlCy2NSNBZjxXfUPoMKOdOcJZU_qfvGdPy8Wvz1VGk5g948y8DKQ032tcrs_3FT77l5JX0ndPU3n-uC4h3-ddH__Ycv7-4JKz55k_757uOHqf3bx_tbyLKY_f5y5i8tnpYjp1qfOC9guHbM5KHiFj21zeDAOiElWKn3k9re7-nea6lQlvy2F67Vg_v2rJop1FhZ9GjRVX2d1GVS8iu8jpdpmhVLltOr9po1cV5VFfK8ZLxhZYYVrzhPsywrM1bnV-KaUZbRNE5YkqSMRlmVZkXVsCYuM9oknKQUOy5kJOWui7TZXglrB7yOkyxbZlfhVys2_HUeYwr3EHq9o2W3V-Y61HQ3w9aSlEphnT3COOFk-LO-e637f0zL5zPqD1otNqiwEZXg8gXvnZa78GsNrR-GHhzfSLwajLx-c4k52GoJuzuQ2V2zfwcAAP__qrON7g">