[llvm-dev] Data structure improvement for the SLP vectorizer
Keno Fischer via llvm-dev
llvm-dev at lists.llvm.org
Wed Mar 15 06:57:28 PDT 2017
Maybe it would illustrative to give an IR example of the case I'm
interested in. Consider
define void @"julia_transform_bvn_derivs_hessian!"(double* %data,
double* %data2, double *%data3, double *%out) {
%element11 = getelementptr inbounds double, double* %data, i32 1
%load10 = load double, double* %data
%load11 = load double, double* %element11
%element21 = getelementptr inbounds double, double* %data2, i32 1
%element22 = getelementptr inbounds double, double* %data2, i32 2
%element23 = getelementptr inbounds double, double* %data2, i32 3
%load20 = load double, double* %data2
%load21 = load double, double* %element21
%load22 = load double, double* %element22
%load23 = load double, double* %element23
%element31 = getelementptr inbounds double, double* %data3, i32 1
%element32 = getelementptr inbounds double, double* %data3, i32 2
%element33 = getelementptr inbounds double, double* %data3, i32 3
%load30 = load double, double* %data3
%load31 = load double, double* %element31
%load32 = load double, double* %element32
%load33 = load double, double* %element33
%mul1 = fmul fast double %load20, %load10
%mul2 = fmul fast double %load21, %load11
%mul3 = fmul fast double %load22, %load10
%mul4 = fmul fast double %load23, %load11
%add1 = fadd fast double %load30, %mul1
%add2 = fadd fast double %load31, %mul2
%add3 = fadd fast double %load32, %mul3
%add4 = fadd fast double %load33, %mul4
%out0 = getelementptr inbounds double, double* %out, i32 0
%out1 = getelementptr inbounds double, double* %out, i32 1
%out2 = getelementptr inbounds double, double* %out, i32 2
%out3 = getelementptr inbounds double, double* %out, i32 3
store double %add1, double *%out0
store double %add2, double *%out1
store double %add3, double *%out2
store double %add4, double *%out3
ret void
}
Currently the SLP vectorizer generates (for avx2 target):
{
%element11 = getelementptr inbounds double, double* %data, i64 1
%load10 = load double, double* %data, align 8
%load11 = load double, double* %element11, align 8
%1 = bitcast double* %data2 to <4 x double>*
%2 = load <4 x double>, <4 x double>* %1, align 8
%3 = bitcast double* %data3 to <4 x double>*
%4 = load <4 x double>, <4 x double>* %3, align 8
%5 = insertelement <4 x double> undef, double %load10, i32 0
%6 = insertelement <4 x double> %5, double %load11, i32 1
%7 = insertelement <4 x double> %6, double %load10, i32 2
%8 = insertelement <4 x double> %7, double %load11, i32 3
%9 = fmul fast <4 x double> %2, %8
%10 = fadd fast <4 x double> %4, %9
%11 = bitcast double* %out to <4 x double>*
store <4 x double> %10, <4 x double>* %11, align 8
ret void
}
I want it to generate
{
%cast = bitcast double* %data to <2 x double>*
%load = load <2 x double>, <2 x double>* %1, align 8
%shuffle = shufflevector <2 x double> %load, <2 x double> undef, i32
<i32 0, i32 1, i32 0, i32 1>
%1 = bitcast double* %data2 to <4 x double>*
%2 = load <4 x double>, <4 x double>* %1, align 8
%3 = bitcast double* %data3 to <4 x double>*
%4 = load <4 x double>, <4 x double>* %3, align 8
%9 = fmul fast <4 x double> %2, %shuffle
%10 = fadd fast <4 x double> %4, %9
%11 = bitcast double* %out to <4 x double>*
store <4 x double> %10, <4 x double>* %11, align 8
ret void
}
More generally though, rather than the loads, there could be a whole
subtree of vectorizable operations.
This becomes an especially large problem for larger vector widths, as
the penalty there for missing the
vectorization opportunity becomes larger.
What I was suggesting that to handle the case, the load could be a
two-element bundle with a
(0, 1, 0, 1) shuffle mask on the edge into the fmul. Hope that gives
an illustrative example of what I care
about and why I think it can be accomplished with the same solution.
Please let me know if that didn't answer your questions.
More information about the llvm-dev
mailing list