[PATCH] D121410: Have cpu-specific variants set 'tune-cpu' as an optimization hint
Andy Kaylor via Phabricator via cfe-commits
cfe-commits at lists.llvm.org
Thu Mar 10 14:53:07 PST 2022
andrew.w.kaylor added a comment.
This example illustrates the problem this patch intends to fix: https://godbolt.org/z/j445sxPMc
For Intel microarchitectures before Skylake, the LLVM cost model says that vector fsqrt is slow, so if fast-math is enabled, we'll use an approximation rather than the vsqrtps instruction when vectorizing a call to sqrtf(). If the code is compiled with -march=skylake or -mtune=skylake, we'll choose the vsqrtps instruction, but with any earlier base target, we'll choose the approximation even if there is a cpu_specific(skylake) implementation in the source code.
For example
__attribute__((cpu_specific(skylake))) void foo(void) {
for (int i = 0; i < 8; ++i)
x[i] = sqrtf(y[i]);
}
compiles to
foo.b:
vmovaps ymm0, ymmword ptr [rip + y]
vrsqrtps ymm1, ymm0
vmulps ymm2, ymm0, ymm1
vbroadcastss ymm3, dword ptr [rip + .LCPI2_0] # ymm3 = [-3.0E+0,-3.0E+0,-3.0E+0,-3.0E+0,-3.0E+0,-3.0E+0,-3.0E+0,-3.0E+0]
vfmadd231ps ymm3, ymm2, ymm1 # ymm3 = (ymm2 * ymm1) + ymm3
vbroadcastss ymm1, dword ptr [rip + .LCPI2_1] # ymm1 = [-5.0E-1,-5.0E-1,-5.0E-1,-5.0E-1,-5.0E-1,-5.0E-1,-5.0E-1,-5.0E-1]
vmulps ymm1, ymm2, ymm1
vmulps ymm1, ymm1, ymm3
vbroadcastss ymm2, dword ptr [rip + .LCPI2_2] # ymm2 = [NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]
vandps ymm0, ymm0, ymm2
vbroadcastss ymm2, dword ptr [rip + .LCPI2_3] # ymm2 = [1.17549435E-38,1.17549435E-38,1.17549435E-38,1.17549435E-38,1.17549435E-38,1.17549435E-38,1.17549435E-38,1.17549435E-38]
vcmpleps ymm0, ymm2, ymm0
vandps ymm0, ymm0, ymm1
vmovaps ymmword ptr [rip + x], ymm0
vzeroupper
ret
but it should compile to
foo.b:
vsqrtps ymm0, ymmword ptr [rip + y]
vmovaps ymmword ptr [rip + x], ymm0
vzeroupper
ret
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D121410/new/
https://reviews.llvm.org/D121410
More information about the cfe-commits
mailing list