<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/63205>63205</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
[X86] Masking a scalar load using AVX-512
</td>
</tr>
<tr>
<th>Labels</th>
<td>
backend:X86,
performance
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
dcaballe
</td>
</tr>
</table>
<pre>
Hello,
I’m hitting a corner case with vector masking and AVX-512. I would like to generate efficient code for a potentially out-of-bounds scalar or single element vector load to then broadcast it to all the lanes of a zmm register and feed it to a fma instruction.
This is what I’ve tried so far:
### Single element vector load:
I tried to load the scalar using a masked vector load to a vector of a single element. Unfortunately, the backend scalarizes the vector load and predicates it using control flow. It doesn’t look too bad in this example but it leads to multiple jumps in my real case scenario, penalizing performance.
```
define <16 x float> @single_element_vload(ptr %ptr, <1 x i1> %mask, <16 x float> %a, <16 x float> %c) {
%vsingle = call <1 x float> @llvm.masked.load.v1float.p0(ptr %ptr, i32 16, <1 x i1> %mask, <1 x float> zeroinitializer) builtin
%single = extractelement <1 x float> %vsingle, i32 0
%insert = insertelement <16 x float> <float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef>, float %single, i32 0
%bcast = shufflevector <16 x float> %insert, <16 x float> zeroinitializer, <16 x i32> <i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>
%fma = call <16 x float> @llvm.fmuladd.v16float(<16 x float> %a, <16 x float> %bcast, <16 x float> %c)
ret <16 x float> %fma
}
declare <1 x float> @llvm.masked.load.v1float.p0(ptr %ptr, i32, <1 x i1> %mask, <1 x float> %passthru)
declare <16 x float> @llvm.fmuladd.v16float(<16 x float> %a, <16 x float> %b, <16 x float> %c)
```
```
single_element_load: # @single_element_load
vxorps %xmm2, %xmm2, %xmm2
testb $1, %sil
je .LBB3_2
vmovss (%rdi), %xmm2 # xmm2 = mem[0],zero,zero,zero
.LBB3_2: # %else
vbroadcastss %xmm2, %zmm2
vfmadd213ps %zmm1, %zmm0, %zmm2 # zmm2 = (zmm0 * zmm2) + zmm1
vmovaps %zmm2, %zmm0
retq
```
### Full vector load:
Here, I tried to load the scalar using a 16-element masked vector load:
```
define <16 x float> @full_element_load_bcast_mask(ptr %ptr, i1 %mask, <16 x float> %a, <16 x float> %c) {
%vmask = insertelement <16 x i1> <i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false, i1 false>, i1 %mask, i32 0
%vsingle = call <16 x float> @llvm.masked.load.v16float.p0(ptr %ptr, i32 16, <16 x i1> %vmask, <16 x float> zeroinitializer) builtin
%single = extractelement <16 x float> %vsingle, i32 0
%insert = insertelement <16 x float> <float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef, float undef>, float %single, i32 0
%bcast = shufflevector <16 x float> %insert, <16 x float> zeroinitializer, <16 x i32> <i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>
%fma = call <16 x float> @llvm.fmuladd.v16float(<16 x float> %a, <16 x float> %bcast, <16 x float> %c)
ret <16 x float> %fma
}
declare <16 x float> @llvm.masked.load.v16float.p0(ptr %ptr, i32, <16 x i1> %mask, <16 x float> %passthru)
declare <16 x float> @llvm.fmuladd.v16float(<16 x float> %a, <16 x float> %b, <16 x float> %c)
```
```
full_element_load_bcast_mask: # @full_element_load_bcast_mask
andl $1, %esi
kmovw %esi, %k1
vmovups (%rdi), %zmm2 {%k1} {z}
vbroadcastss %xmm2, %zmm2
vfmadd213ps %zmm1, %zmm2, %zmm0 # zmm0 = (zmm2 * zmm0) + zmm1
retq
```
The assembly for this case looks much better but I'm curious if this could be optimized further. `vbroadcastss` and `vfmadd*ps` instructions support masking which, if I understand the documentation correctly, would let us fold the `vmovups` instruction into a masked `vbroadcastss`:
```
test_broadcast:
andl $1, %esi
kmovw %esi, %k1
vbroadcastss (%rdi), %zmm2 {%k1} {z}
vfmadd213ps %zmm1, %zmm2, %zmm0 # zmm0 = (zmm2 * zmm0) + zmm1
retq
```
and even the masked `vbroadcastss` into a masked `vfmadd*ps` with operand broadcasting:
```
test_fma:
andl $1, %esi
kmovw %esi, %k1
vfmadd213ps (%rdi){1to16}, %zmm0, %zmm1 {%k1} {z}
retq
```
I'm probably missing something but I wanted to check if this makes sense or if you have any other suggestions to improve the performance for this case.
Thanks in advance.
Diego
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzsWV9vozoW_zTOy1EjMIUmD3lomqlupd2nvbu6b5WBQ_DE2KxtaJtPv7IhBPKn7azmau5cDYoSYuzj49_54x_HzBi-lYgrEq9JvJmxxpZKr_KMpUwInKUqf1v9hkIoQh9IsCHBfff9RL5QsgjIcllBya3lcgsMMqUlasiYQXjhtoQWM6s0VMzsfA-Zw_1__riJQzqHJ3hRjchB8B2CVbBFiZpZBCwKnnGUFjKVIxRKA4NaWZSWMyHeQDX2RhU3qWpkbsBkTDANSoPhcisQUGDlRveTC8VyJ9-WKCHViuUZMxa4dY1MCPcABJNoQBXAYF9VoHHLjUXtNS4Q80N3KCoGXBqrm8xyJedjUH4vuQFu4KVkFo4ItQhWc8zBKCiYJtH9eBChUfeBf11Vfxjy1Euyql9WiYf1N6azgcMa89PFs0ODX-IUqDn8WxZK20Yyi-KN0AcvN2XZDmXey-d7NL55LNihU2vMecYsGodRp0WmpNVKQCHUyxyeLOQKjRwAsSCU2oFVClKWA5dgHXL4yqpaIKSNt45Alhune9UIy92Dr01VG9e9egONTHSOZjKUTHPnoVCjZILvnQ416kLpiskMJ0YiSdB__N8cCy4RSPQQJvDqNGaWRF-A3AYdSs89Ss-ttwRd1FYDoXFttZvRDYRX4KEfRGMH_6F9KpDG7NqDjNAlkLt1pxIAuMa2txKJNpA5P-2nGqsoRFvNO4vPnXrzNvSP53VwpimPKITJRzqPJ9ijVlxyF3V8j9opmTZcWC4nio70xFerWWYPLnym8bCqg0LBRBKXBrX1krrbsaApaNGDv4dG5lg4aX_vv9GXY8sA-WUQU5_eHIambIpCYB-xlxyvQ_miW54Z_9iHR7S3QTf7oMaPvYm-THBwmXoSO6fx7YOnqBrBchc4SfeMLr4pdD3a78X1USeNlxzZK9qnprvNISllgmn8DhH_LeHuBjJjbKmbQe-xJn8Ofh9jd5KxLzae5Op-04QLl9tpz5O7HzBYyl3tq9K18Z70WlUdkOe3kyEWjU39FLdh38lwMe3zFbvf-T_W6-j5REBbqdb4OReExjrnDoPjbFdW4x85T6-wIvE6IPGG0AcXvyc_fq7DxJfROUOKxigMnqg5sCin7ClE-zNc2qJieU7DqDa9YNcpPPYPRkOHqfeHZRG6cH2A0Hvf6DdLugYv4ww_VptBi9EEk34a7X_f862Bkj02QlwkYt33b6h9Ev4ELQuTm8Nmds7QTinhpylK0Qgx8eFnn4-euxA_TQfhdyYoTtQ7m3WfcKIHHkLBnBd1Svxc993OO8XufNO9SNYuJsxp7k4-SdeScQJvr9rwuzC2U-P_omy_KNsvyvYBZfskXfqG6L8Y-u9l75-CvE3_vrt_TRhKT9veHTDZ45nMBUzJGBo-7bOrVPvSuZ571vXaXSAVjScVZ6Ssoyh3627U3cbd7weX-NPo0pjYjOlSMKJL9ECXgut06QMa9HuJwIzBKhVvvgrnqzS-5iKU2hmomqyEFK1F7Ys2T4TeVZA1mqvGAC_6Ab7KlyKo2vKK7zGHotG2RD0HkgRjdEgS-JqSa_YgEHpf-9ZRyc2AaepaaTuUFV9KnpU-agp48tlXG-vkOB6Wq6xx3sLcWMiU1pjZrsrV1x_RQmOgUKIb4CbvTH4yMXDpK2k9fzvX_UjjpnC694LnoevQ6_t76pmj_b8e-xfxQGdDbFF6u1yF_YJdps7jS9GqRu3kDWO53LoUc91iLsl__Jb0V7k-41M_VMH3rk85-49V8fr1UeyMgvBuHVoVJi7gLr38-iVeiM4fvKQPorRL-7VWKXM7RcWNf-c1qkJbuju_N8ALk7Z7R85KzHbD_lCxHRowKA2C0q75TTVQshaByTdQbqcA02y3aLr0bxXwqtaqRZ8XRoX-6S51cjbD5M4fHrC8HR0KbDhu1eTl23_P8lWUL6Mlm-EqTBYJvaNRkszKVcKKIsD0Ng7iRZjlbJlEMbulcZCGGCwiNuMrGtAoSIJlEMRLmszvEJNlssA4T5fLoKDkNsCKcTH3FEzp7Ywb0-AqiWgQzwRLURh_HkdpfwRDovs_FgmhLusSSkcLdm3xZqZXTtZN2myNo3bcWHOUbrkV_oDPyYg38M_DYdyhROFLFl2doj-cmzVarEpra-M2K_pI6OOW27JJ55mqCH10svufm1qrr5hZQh_9Mgyhj34l_wsAAP__flD4MQ">