<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    Hi Jyotirmoy Bhattacharya,<br>

    <br>

    I've vectorized the outer loop of your example with RV, the Region

    Vectorizer (<a class="moz-txt-link-freetext" href="https://github.com/cdl-saarland/rv">https://github.com/cdl-saarland/rv</a>). I've attached the

    full IR. This is the code i got for the inner most loop:<br>

    <br>

    <style type="text/css">pre.cjk { font-family: "NSimSun",monospace; }p { margin-bottom: 0.1in; line-height: 120%; }</style>

    <pre class="western">for.body4.rv:                                     ; preds = %for.body4.rv, %for.body4.lr.ph.rv

  %indvars.iv14 = phi i64 [ %0, %for.body4.lr.ph.rv ], [ %indvars.iv.next21, %for.body4.rv ]

  %u1.03315 = phi <4 x double> [ zeroinitializer, %for.body4.lr.ph.rv ], [ %u0.03216, %for.body4.rv ]

  %u0.03216 = phi <4 x double> [ zeroinitializer, %for.body4.lr.ph.rv ], [ %add_SIMD, %for.body4.rv ]

  %mul5_SIMD = fmul <4 x double> %mul_SIMD, %u0.03216

  %sub_SIMD = fsub <4 x double> %mul5_SIMD, %u1.03315

  %arrayidx717 = getelementptr inbounds double, double* %coeffs, i64 %indvars.iv14

  %scal_load18 = load double, double* %arrayidx717, align 8

  %.splatinsert19 = insertelement <4 x double> undef, double %scal_load18, i32 0

  %.splat20 = shufflevector <4 x double> %.splatinsert19, <4 x double> undef, <4 x i32> zeroinitializer

  %add_SIMD = fadd <4 x double> %sub_SIMD, %.splat20

  %indvars.iv.next21 = add nsw i64 %indvars.iv14, -1

  %cmp222 = icmp sgt i64 %indvars.iv14, 0

  br i1 %cmp222, label %for.body4.rv, label %for.cond.cleanup3.loopexit.rv

}</pre>

    <br>

    To reproduce this get the release_38 branch of RV from github and do

    as follows:<br>

    <br>

    File cheby.c:<br>

    <tt>void cheby_eval(double *coeffs,int n,double *xs,double *ys,int

      m)</tt><br>

    <tt>{

    </tt>

    <pre wrap="">  #pragma omp simd

  for (int i=0;i<m;i++){

    double x = xs[i];

    double u0=0,u1=0,u2=0;

    for (int k=n;k>=0;k--){

      u2 = u1;

      u1 = u0;

      u0 = 2*x*u1-u2+coeffs[k];

    }

    ys[i] = 0.5*(coeffs[0]+u0-u2);

  }

}

<EOF>

</pre>

    1. Compile to IR w/o any of LLVM's vectorizers:<br>

    <tt>clang -O3 -fno-vectorize -fno-slp-vectorize cheby.c -c

      -emit-llvm -S -o cheby.ll</tt><br>

    <br>

    2. Run the IR through RV's cmd line vectorizer<br>

    <tt>./bin/rvTool -loopvec -w 4 -i cheby.ll -k cheby_eval -o

      cheby.rv.ll</tt><br>

    <br>

    I'd like to add your code to our test suite on github if that is ok

    with you.<br>

    <br>

    Please get in touch with me if you have any other outer loops that

    could/should be vectorized.<br>

    <br>

    Regards,<br>

    Simon<br>

    <br>

    <div class="moz-cite-prefix">On 05/10/2017 04:09 PM, via llvm-dev

      wrote:

    </div>

    <blockquote type="cite"

      cite="mid:mailman.25285.1494425357.1282.llvm-dev@lists.llvm.org">

      <pre wrap="">

I have the following C++ code that evaluates a Chebyshev polynomial using

Clenshaw's algorithm

void cheby_eval(double *coeffs,int n,double *xs,double *ys,int m)

{

  #pragma omp simd

  for (int i=0;i<m;i++){

    double x = xs[i];

    double u0=0,u1=0,u2=0;

    for (int k=n;k>=0;k--){

      u2 = u1;

      u1 = u0;

      u0 = 2*x*u1-u2+coeffs[k];

    }

    ys[i] = 0.5*(coeffs[0]+u0-u2);

  }

}

I'm hoping for an autovectorization of the outer loop so that the inner

loop operates on vectors.

When compiled with

clang++ -O3 -march=haswell -Rpass-analysis=loop-vectorize -S chebyshev.cc

using clang++ 3.8.1-23, no vectorization happens and I get the message

chebyshev.cc:19:18: remark: loop not vectorized: cannot identify array

bounds

      [-Rpass-analysis=loop-vectorize]

    ys[i] = 0.5*(coeffs[0]+u0-u2);

                 ^

chebyshev.cc:21:1: remark: loop not vectorized: value that could not be

      identified as reduction is used outside the loop

      [-Rpass-analysis=loop-vectorize]

On the same code icc vectorizes the outer loop as expected.

I was wondering if there are small ways in which I can change my code to

help LLVM's autovectorizer to succeed. I would also appreciate any pointers

to documentation or LLVM source that can help me better understand how

autovectorization of outer loops works.

Regards,

Jyotirmoy Bhattacharya

PS. The interesting part of icc's assembler output is

..B1.4:                         # Preds ..B1.8 ..B1.3

        xorl      %r15d, %r15d                                  #14.5

        xorl      %ebx, %ebx                                    #14.21

        testq     %rsi, %rsi                                    #14.21

        vmovupd   (%rdx,%r9,8), %ymm3                           #12.16

        vxorpd    %ymm5, %ymm5, %ymm5                           #13.14

        vmovdqa   %ymm1, %ymm4                                  #13.19

        vmovdqa   %ymm1, %ymm2                                  #13.24

        jl        ..B1.8        # Prob 2%                       #14.21

..B1.5:                         # Preds ..B1.4

        vaddpd    %ymm3, %ymm3, %ymm3                           #17.14

..B1.6:                         # Preds ..B1.6 ..B1.5

        vmovapd   %ymm4, %ymm2                                  #20.3

        incq      %r15                                          #14.5

        vmovapd   %ymm5, %ymm4                                  #20.3

        vfmsub213pd %ymm2, %ymm3, %ymm5                         #17.19

        vbroadcastsd (%r11,%rbx,8), %ymm6                       #17.22

        decq      %rbx

        vaddpd    %ymm5, %ymm6, %ymm5                           #17.22

        cmpq      %r10, %r15                                    #14.5

        jb        ..B1.6        # Prob 82%                      #14.5

..B1.8:                         # Preds ..B1.6 ..B1.4

        vbroadcastsd (%rdi), %ymm3                              #19.18

        vaddpd    %ymm3, %ymm5, %ymm4                           #19.28

        vsubpd    %ymm2, %ymm4, %ymm2                           #19.31

        vmulpd    %ymm2, %ymm0, %ymm5                           #19.31

        vmovupd   %ymm5, (%rcx,%r9,8)                           #19.5

        addq      $4, %r9                                       #11.3

        cmpq      %r8, %r9                                      #11.3

        jb        ..B1.4        # Prob 82%                      #11

-------------- next part --------------

An HTML attachment was scrubbed...

URL: <a class="moz-txt-link-rfc2396E" href="http://lists.llvm.org/pipermail/llvm-dev/attachments/20170510/9a48b564/attachment-0001.html"><http://lists.llvm.org/pipermail/llvm-dev/attachments/20170510/9a48b564/attachment-0001.html></a>

------------------------------

</pre>

    </blockquote>

    <br>

    <pre class="moz-signature" cols="72">-- 

Simon Moll

Researcher / PhD Student

Compiler Design Lab (Prof. Hack)

Saarland University, Computer Science

Building E1.3, Room 4.31

Tel. +49 (0)681 302-57521 : <a class="moz-txt-link-abbreviated" href="mailto:moll@cs.uni-saarland.de">moll@cs.uni-saarland.de</a>

Fax. +49 (0)681 302-3065  : <a class="moz-txt-link-freetext" href="http://compilers.cs.uni-saarland.de/people/moll">http://compilers.cs.uni-saarland.de/people/moll</a></pre>

  </body>

</html>