<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/63980>63980</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
[SLPVectorizer] Performance degradation with xor + fshl on X86
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
annamthomas
</td>
</tr>
</table>
<pre>
With SLP Vectorizer, a hot loop with 6 xors + 2 fshl get reduced to 3 xors + 1 fshl. We vectorize with a VF of 2.
The SLP cost model gives it a cost of -8.
This is the loop in question:
```
%iv = phi i64 [ %add323, %vectorized_slp_bb ], [ 16, %bb2 ]
%add288 = add nsw i64 %iv, -3
%getelementptr289 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add288
%load290 = load i32, ptr addrspace(1) %getelementptr289, align 4, !tbaa !28, !noundef !3
%add291 = add nsw i64 %iv, -8
%getelementptr292 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add291
%load293 = load i32, ptr addrspace(1) %getelementptr292, align 4, !tbaa !28, !noundef !3
%xor294 = xor i32 %load293, %load290
%add295 = add nsw i64 %iv, -14
%getelementptr296 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add295
%load297 = load i32, ptr addrspace(1) %getelementptr296, align 4, !tbaa !28, !noundef !3
%xor298 = xor i32 %xor294, %load297
%add299 = add nsw i64 %iv, -16
%getelementptr300 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add299
%load301 = load i32, ptr addrspace(1) %getelementptr300, align 4, !tbaa !28, !noundef !3
%xor302 = xor i32 %xor298, %load301
%call303 = call i32 @llvm.fshl.i32(i32 %xor302, i32 %xor302, i32 1) #5
%getelementptr304 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %iv
store i32 %call303, ptr addrspace(1) %getelementptr304, align 4, !tbaa !28
%add305 = add nuw nsw i64 %iv, 1
%add306 = add nsw i64 %iv, -2
%getelementptr307 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add306
%load308 = load i32, ptr addrspace(1) %getelementptr307, align 4, !tbaa !28, !noundef !3
%add309 = add nsw i64 %iv, -7
%getelementptr310 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add309
%load311 = load i32, ptr addrspace(1) %getelementptr310, align 4, !tbaa !28, !noundef !3
%xor312 = xor i32 %load311, %load308
%add313 = add nsw i64 %iv, -13
%getelementptr314 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add313
%load315 = load i32, ptr addrspace(1) %getelementptr314, align 4, !tbaa !28, !noundef !3
%xor316 = xor i32 %xor312, %load315
%add317 = add nsw i64 %iv, -15
%getelementptr318 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add317
%load319 = load i32, ptr addrspace(1) %getelementptr318, align 4, !tbaa !28, !noundef !3
%xor320 = xor i32 %xor316, %load319
%call321 = call i32 @llvm.fshl.i32(i32 %xor320, i32 %xor320, i32 1) #5
%getelementptr322 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add305
store i32 %call321, ptr addrspace(1) %getelementptr322, align 4, !tbaa !28
%add323 = add nuw nsw i64 %iv, 2
%icmp324 = icmp ugt i64 %add305, 78
br i1 %icmp324, label %6, label %vectorized_slp_bb
```
When we vectorize it, we get:
```
vectorized_slp_bb: ; preds = %vectorized_slp_bb, %bb2
%iv = phi i64 [ %add323, %vectorized_slp_bb ], [ 16, %bb2 ]
%add288 = add nsw i64 %iv, -3
%getelementptr289 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add288
%add291 = add nsw i64 %iv, -8
%getelementptr292 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add291
%add295 = add nsw i64 %iv, -14
%getelementptr296 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add295
%add299 = add nsw i64 %iv, -16
%getelementptr300 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add299
%getelementptr304 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %iv
%add305 = add nuw nsw i64 %iv, 1
%25 = load <2 x i32>, ptr addrspace(1) %getelementptr289, align 4, !tbaa !28
%26 = load <2 x i32>, ptr addrspace(1) %getelementptr292, align 4, !tbaa !28
%27 = xor <2 x i32> %26, %25
%28 = load <2 x i32>, ptr addrspace(1) %getelementptr296, align 4, !tbaa !28
%29 = xor <2 x i32> %27, %28
%30 = load <2 x i32>, ptr addrspace(1) %getelementptr300, align 4, !tbaa !28
%31 = xor <2 x i32> %29, %30
%32 = call <2 x i32> @llvm.fshl.v2i32(<2 x i32> %31, <2 x i32> %31, <2 x i32> <i32 1, i32 1>)
store <2 x i32> %32, ptr addrspace(1) %getelementptr304, align 4, !tbaa !28
%add323 = add nuw nsw i64 %iv, 2
%icmp324 = icmp ugt i64 %add305, 78
br i1 %icmp324, label %6, label %vectorized_slp_bb
```
We see about a 40% degradation on benchmark that optimizes this hot loop.
The assembly for this loop shows we use 3 xor instead of vpxor and the fshl lowering using xmm registers:
```
movq -12(%rax,%rcx,4), %rdx
xorq 8(%rax,%rcx,4), %rdx
xorq -36(%rax,%rcx,4), %rdx
xorq -44(%rax,%rcx,4), %rdx
vmovq %rdx, %xmm0
vpsrld $31, %xmm0, %xmm1
vpaddd %xmm0, %xmm0, %xmm0
vpor %xmm1, %xmm0, %xmm0
vmovq %xmm0, 20(%rax,%rcx,4)
addq $2, %rcx
cmpq $78, %rcx
```
While looking at cost model for X86 arithmetic instructions, I do not see anything for v2i32 for XOR. Should we actually vectorize this loop?
Will attach the IR reproducer and -slp-threshold=2 shows we only vectorize this tree and still see the 40% degradation.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzsWV9v47gR_zTyy8AG_-jvgx82SQMccEAXt8Vt3w6USFvsSaKXpBznPn1BSrElOdaunbrtAbdYJArJmfnxpxnODMWMkdtGiHUQPQTR04K1tlR6zZqG1bZUNTOLXPHX9VdpS_jy82f4VRRWafmH0AF5BAalslAptYMXtyKGg9IGAvIABDamrGArLGjB20JwsAroaQH2C1bwVcD-TWmnhcGvz6A2QFYQoKcAffpHKbzxQhkLteKigq3cCwPSAutG1QaW6apb_iYkDUgDthQdQtnAt1YYK1UT0H5REKP-v_8TICCR3ENAn2BXSpBxCEH04EYZ55RQt-mAREfA_DdT7X7LcwiiJz8XPQCO-1V5Tvz4STXjnKSpV884h8a8dCacUSe0pIPFW2FFJWrR2J3VJM282GgQZJOrtuEGJCVO3o0xzrXZsUIEJMUByc5UhW5lb7YD1BsNSFQpxkmGvCn3fJVikmbeKSq5bSDsSMA2Z8z9Jmk_0DjEYuMe6YSaDM9Qk16kJiP3oibDU2rojdRk5HZqDkqTLPSGD0o7uwM8va_1L25KaDRDKA7hMqXx3SiNBkY71MmtnMYf5DSdctoxPaI0mVKazVEaw8lhRmApQndjNJswShG-jVGK0IcYpYi8y2g6YJQiPBAqWFVR1AWVe-7EQlRV-3rl84PHn560UeT38-5AvykaXfJrisJ7vAWXMXqLxiot3tD1u7uC_3COfxi7IkWD6G5fzt0RT9fHM65LLnOW3MlzKYrHxytF6a2Om3wo81A0F9XJRWrwvYKaomxCDb41pvEHYxqfxXSPZxTU6YRQTOeOSXox81B8lwjtIE3PSRzdyulsnP4Ap_E75yTFZEgpjqaUJnOURpcSD07vRmhyRmh2K6Hpxwgl6F1C4xGh2TTxEHxN4iFoknhOA99PPOReNSpF0eXcQ_CPvwIyX6GOfZHQ-dQzzCayqHeUdGHtnqHd2skGyCMkRxO5BokHgm62Yrmo3Fg8-uusD3u3q-t-fi1FAy_DZlNap-xFuLdyqSM8t0A_wXf-BfQBdlpw47f8Lsxjg_hX3_np_7_7-7O0U1f0J_-zBuW_XI_fUCqTQVUQ0EcCBw-E_u0_dREyBEfiDxubv1oYGUuOWXJkqwPSnx5u-8e4JOnH4c126SN42Qy85A3eSISij-L7Ts87MoZn8GU9PodoIEJOFcZEZlhr7ElXbZyppV2V_UPD9LEvRd5qEsdCNulMz1Vd02Vd1Z7-GWsEAUYIYLlqLTAIUUAi4GKrGWdWqgZUA7loirJm-newJbOgdlbW8g9hwJbSHO_Dh1fXzBhR59UrbJTuVvn7aFOqF-PKj9aI7mIcZGOsYBzUBvY7N8Aa7i-w_V16pV6Els0WWuN-HuoatNhKY4U2F2-0a7X_BrDE3r1IpNkhII_uoXAPYXB0XM0Pxzd4UPobpFdLwJLGVwstw_Aamb3fUDfYTx_q-njxud8ZXXHnUyHFw_njIz4tZZxzOF-B3tWrNBw1zC09Anybdm3Cpf312hnn3zzmtw5QF6cdF_Wum0zSs9kLla6s_DeP352bMDv8buJc8J9pDExLW9bCysI7nW4L59_GGfgJuIJG2S4UmldbOjVO0B9TnYq__7KCL6VqK-48mBW2ZVX1Oiiuj34e0GcYoZNVBcxaVpTetX_6BbTYacXbQnQOvzTVbmlLLUypKh7QJ3KKFdWcW7HaA-VgrNPtYDvFZ9G7ggVfU57RjC3EGsdplsYxTdNFueYZ5hnKkEjDLEqyOMvyNIlitkE4TXBBFnJNEKEoIQhHOAujFc9TkqQxTwoRM5ZsghCJmslq5U91pbcLaUwr1jHNUrTwp5HxX9kIacQL-MmAkCB6Wui1k1nm7da4rCCNNSctVtrKf5778vPnwce36Ak-C71RumZNIUZnlP-G5pMUeegODtW4d75odbUurd35w4I8B-R5K23Z5qtC1QF5dib7X8udVv8ShQ3IswdqAvLsN_LvAAAA__8aNw6M">