[llvm-dev] Data structure improvement for the SLP vectorizer

Wed Mar 15 22:11:53 PDT 2017

Comments inlined.

> -----Original Message-----
> From: Keno Fischer [mailto:keno at juliacomputing.com]
> Sent: Wednesday, March 15, 2017 7:27 PM
> To: Shahid, Asghar-ahmad <Asghar-ahmad.Shahid at amd.com>
> Cc: llvm-dev at lists.llvm.org; Michael Kuperstein <mkuper at google.com>;
> Matthias Braun <matze at braunis.de>
> Subject: Re: Data structure improvement for the SLP vectorizer
> 
> Maybe it would illustrative to give an IR example of the case I'm interested in.
> Consider
> 
> define void @"julia_transform_bvn_derivs_hessian!"(double* %data,
> double* %data2, double *%data3, double *%out) {
>     %element11 = getelementptr inbounds double, double* %data, i32 1
>     %load10 = load double, double* %data
>     %load11 = load double, double* %element11
> 
>     %element21 = getelementptr inbounds double, double* %data2, i32 1
>     %element22 = getelementptr inbounds double, double* %data2, i32 2
>     %element23 = getelementptr inbounds double, double* %data2, i32 3
>     %load20 = load double, double* %data2
>     %load21 = load double, double* %element21
>     %load22 = load double, double* %element22
>     %load23 = load double, double* %element23
> 
>     %element31 = getelementptr inbounds double, double* %data3, i32 1
>     %element32 = getelementptr inbounds double, double* %data3, i32 2
>     %element33 = getelementptr inbounds double, double* %data3, i32 3
>     %load30 = load double, double* %data3
>     %load31 = load double, double* %element31
>     %load32 = load double, double* %element32
>     %load33 = load double, double* %element33
> 
>     %mul1 = fmul fast double %load20, %load10
>     %mul2 = fmul fast double %load21, %load11
>     %mul3 = fmul fast double %load22, %load10
>     %mul4 = fmul fast double %load23, %load11
> 
>     %add1 = fadd fast double %load30, %mul1
>     %add2 = fadd fast double %load31, %mul2
>     %add3 = fadd fast double %load32, %mul3
>     %add4 = fadd fast double %load33, %mul4
> 
>     %out0 = getelementptr inbounds double, double* %out, i32 0
>     %out1 = getelementptr inbounds double, double* %out, i32 1
>     %out2 = getelementptr inbounds double, double* %out, i32 2
>     %out3 = getelementptr inbounds double, double* %out, i32 3
> 
>     store double %add1, double *%out0
>     store double %add2, double *%out1
>     store double %add3, double *%out2
>     store double %add4, double *%out3
> 
>     ret void
> }
> 
> Currently the SLP vectorizer generates (for avx2 target):
> 
> {
>   %element11 = getelementptr inbounds double, double* %data, i64 1
>   %load10 = load double, double* %data, align 8
>   %load11 = load double, double* %element11, align 8
>   %1 = bitcast double* %data2 to <4 x double>*
>   %2 = load <4 x double>, <4 x double>* %1, align 8
>   %3 = bitcast double* %data3 to <4 x double>*
>   %4 = load <4 x double>, <4 x double>* %3, align 8
>   %5 = insertelement <4 x double> undef, double %load10, i32 0
>   %6 = insertelement <4 x double> %5, double %load11, i32 1
>   %7 = insertelement <4 x double> %6, double %load10, i32 2
>   %8 = insertelement <4 x double> %7, double %load11, i32 3
>   %9 = fmul fast <4 x double> %2, %8
>   %10 = fadd fast <4 x double> %4, %9
>   %11 = bitcast double* %out to <4 x double>*
>   store <4 x double> %10, <4 x double>* %11, align 8
>   ret void
> }

This is because currently the broadcast elements are being handled as scalars to be "gathered".

> 
> I want it to generate
> 
> {
>   %cast = bitcast double* %data to <2 x double>*
>   %load = load <2 x double>, <2 x double>* %1, align 8
>   %shuffle = shufflevector <2 x double> %load, <2 x double> undef, i32
> <i32 0, i32 1, i32 0, i32 1>
>   %1 = bitcast double* %data2 to <4 x double>*
>   %2 = load <4 x double>, <4 x double>* %1, align 8
>   %3 = bitcast double* %data3 to <4 x double>*
>   %4 = load <4 x double>, <4 x double>* %3, align 8
>   %9 = fmul fast <4 x double> %2, %shuffle
>   %10 = fadd fast <4 x double> %4, %9
>   %11 = bitcast double* %out to <4 x double>*
>   store <4 x double> %10, <4 x double>* %11, align 8
>   ret void
> }

Here, %load should be 4 element load instead of 2 and you can still do the required broadcast
With above shuffle. Why this is important is that with our scheme of having a DAG with masks on
Edges to a single tree node, generating a vector load of lesser length than the chosen vector factor
Will be problematic.
If we want to do so we would require another tree entry with lesser number of scalar loads.

> 
> More generally though, rather than the loads, there could be a whole
> subtree of vectorizable operations.
> This becomes an especially large problem for larger vector widths, as the
> penalty there for missing the vectorization opportunity becomes larger.
> 
> What I was suggesting that to handle the case, the load could be a two-
> element bundle with a (0, 1, 0, 1) shuffle mask on the edge into the fmul.
> Hope that gives an illustrative example of what I care about and why I think it
> can be accomplished with the same solution.
> Please let me know if that didn't answer your questions.

Yes, that sums up the problem. IMO, this can be achieved by checking for the broadcast case: 
This is essentially looking for the partial consecutiveness for memory accesses and having a
proper mask for that on the edge from this USE node to DEF node.

Regard,
Shahid