[llvm-dev] Matrix Multiplication not Vectorized using double pointers

Sun Oct 20 16:23:44 PDT 2019

Compile with optimization remarks on ("-Rpass-analysis=.*", or 
"-Rpass-missed=.*") to get some optimization remarks for why the 
optimizer is or is not handling certain constructs. (Warning, this is a 
lot, and it will often be hard to decode if you don't already know how 
the relevant optimization passes work.)

There are several problems here, but zooming out, there are just way too 
many pointers here that could all potentially alias. Proving statically 
that none of these pointers alias is easier said than done, especially 
when variable-sized arrays are involved. Even if you could handle that 
within a function (not trivial), the next problem after that is crossing 
function boundaries (if there's code between allocation and use), and 
that gets messy really quick.

If you have 2D arrays, then if at all possible, allocating them 
contiguously and use a single base pointer to access all elements of a 
given array. That's a form the loop analysis passes know how to deal 
with and it's going to go better, although you will probably still need 
some annotations to promise the compiler there's no aliasing. (Easier 
said than done; you can flag the result pointer as __restrict, but let's 
just say that this path is fraught with peril.)

After writing the whole thing using single base pointers, manual 2D 
array access and, I can get Clang 9.0.0 to vectorize the loop (for x86 
at -march=skylake which has cheap enough gathers), but it's still a very 
direct translation of the original code, which is not what you actually 
want.

Efficient vectorization of matrix multiplies is a well-studied problem, 
but that doesn't make it easy. Turning the basic three nested loops into 
a form that achieves high performance requires _many_ different 
transform steps. (Among other things, almost certainly turning the three 
nested loops into at least five, to process the data in small blocks, 
getting rid of the gathers and turning them into in-register 
transpositions.)

Getting state-of-the-art loop optimizations into LLVM has been an 
ongoing project for many years (-> "Polly", though that seems to have 
gotten quiet recently?). I believe LLVM+Polly would do a lot better on 
this particular example, but it's not enabled (or compiled in) for 
normal compiler releases.

-Fabian

On 10/20/2019 12:31 AM, hameeza ahmed via llvm-dev wrote:
> Hello,
> My matrix multiplication code has variables allocated via double 
> pointers on heap. The code is not getting vectorized. Following is the code.
> 
>   int **buffer_A = (int **)malloc(vecsize * sizeof(int *));
>   int **buffer_B = (int **)malloc(vecsize * sizeof(int *));
> for(p = 0; p < vecsize; p++)
>      {
>           buffer_A[p] = (int *)malloc(vecsize * sizeof(int));
>       }
> for(p = 0; p < vecsize; p++)
>      {
>           buffer_B[p] = (int *)malloc(vecsize * sizeof(int));
>       }
>      long i, j, k;
>   int **result = (int **)malloc(vecsize * sizeof(int *));
>      for(p = 0; p < vecsize; p++)
>      {
>           result[p] = (int *)malloc(vecsize * sizeof(int));
>       }
>      for(i = 0; i < vecsize; i++)
>      {
>          for(j = 0; j < vecsize; j++)
>          {
>           result[i][j]=0;
>              for(k = 0; k < vecsize; k++)
>              {
>                  result[i][j] += buffer_A[i][k] * buffer_B[k][j];
>              }
>          }
>      }
> what is the issue? how to resolve it?
> 
> Thank You
> 
> Regards
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>