[LLVMbugs] [Bug 23645] New: pow calls are not vectorised on Windows

Sun May 24 13:06:51 PDT 2015

https://llvm.org/bugs/show_bug.cgi?id=23645

            Bug ID: 23645
           Summary: pow calls are not vectorised on Windows
           Product: new-bugs
           Version: 3.6
          Hardware: PC
                OS: Windows NT
            Status: NEW
          Severity: normal
          Priority: P
         Component: new bugs
          Assignee: unassignedbugs at nondot.org
          Reporter: nick at indigorenderer.com
                CC: llvmbugs at cs.uiuc.edu
    Classification: Unclassified

In the Auto-vectorization doc it seems to be claimed that pow() calls will be
vectorised:
http://llvm.org/docs/Vectorizers.html#vectorization-of-function-calls
However, some pow calls I'm making in a loop aren't being vectorised.

Platform: Windows 8 64 bit, host compiler VS2012. LLVM 3.6.  I'm JITing LLVM
code.

Optimised IR:
-----------------------------------------------------------------
; Function Attrs: nounwind
define internal void @work_function([268435456 x float]* noalias nocapture
align 32, [268435456 x float]* noalias nocapture readonly align 32, float
(float)* nocapture readnone, i64, i64) #0 {
entry:
  %backedge.overflow = icmp eq i64 %4, 0
  br i1 %backedge.overflow, label %loop, label %overflow.checked

overflow.checked:                                 ; preds = %entry
  %n.vec = and i64 %4, -4
  %cmp.zero = icmp eq i64 %n.vec, 0
  br i1 %cmp.zero, label %middle.block, label %vector.body

vector.body:                                      ; preds = %overflow.checked,
%vector.body
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %overflow.checked ]
  %induction14 = or i64 %index, 1
  %induction25 = or i64 %index, 2
  %induction36 = or i64 %index, 3
  %5 = getelementptr inbounds [268435456 x float]* %1, i64 0, i64 %index
  %6 = getelementptr inbounds [268435456 x float]* %1, i64 0, i64 %induction14
  %7 = getelementptr inbounds [268435456 x float]* %1, i64 0, i64 %induction25
  %8 = getelementptr inbounds [268435456 x float]* %1, i64 0, i64 %induction36
  %9 = load float* %5, align 16
  %10 = load float* %6, align 4
  %11 = load float* %7, align 8
  %12 = load float* %8, align 4
  %13 = call float @llvm.pow.f32(float %9, float 0x40019999A0000000)
  %14 = call float @llvm.pow.f32(float %10, float 0x40019999A0000000)
  %15 = call float @llvm.pow.f32(float %11, float 0x40019999A0000000)
  %16 = call float @llvm.pow.f32(float %12, float 0x40019999A0000000)
  %17 = getelementptr inbounds [268435456 x float]* %0, i64 0, i64 %index
  %18 = getelementptr inbounds [268435456 x float]* %0, i64 0, i64 %induction14
  %19 = getelementptr inbounds [268435456 x float]* %0, i64 0, i64 %induction25
  %20 = getelementptr inbounds [268435456 x float]* %0, i64 0, i64 %induction36
  store float %13, float* %17, align 16
  store float %14, float* %18, align 4
  store float %15, float* %19, align 8
  store float %16, float* %20, align 4
  %index.next = add i64 %index, 4
  %21 = icmp eq i64 %index.next, %n.vec
  br i1 %21, label %middle.block, label %vector.body, !llvm.loop !0

middle.block:                                     ; preds = %vector.body,
%overflow.checked
  %resume.val = phi i64 [ 0, %overflow.checked ], [ %n.vec, %vector.body ]
  %cmp.n = icmp eq i64 %resume.val, %4
  br i1 %cmp.n, label %afterloop, label %loop

loop:                                             ; preds = %entry,
%middle.block, %loop
  %loop_index_var = phi i64 [ %next_var, %loop ], [ 0, %entry ], [ %resume.val,
%middle.block ]
  %22 = getelementptr inbounds [268435456 x float]* %1, i64 0, i64
%loop_index_var
  %23 = load float* %22, align 4
  %24 = tail call float @llvm.pow.f32(float %23, float 0x40019999A0000000) #0
  %25 = getelementptr inbounds [268435456 x float]* %0, i64 0, i64
%loop_index_var
  store float %24, float* %25, align 4
  %next_var = add i64 %loop_index_var, 1
  %loopcond = icmp eq i64 %next_var, %4
  br i1 %loopcond, label %afterloop, label %loop, !llvm.loop !3

afterloop:                                        ; preds = %loop,
%middle.block
  ret void
}
----------------------------------------------------------------

Autovectorisation is enabled.
A similar loop with calls to e.g. the sqrt intrinsic generate the expected
vectorised sqrt instructions.

d0k says on IRC: "calling the vc++ implementation would be the right way, but
that's not implemented"

VC++ has a vectorised pow implementation: __vdecl_powf4, which calls
__sse2_powf4, see also https://msdn.microsoft.com/en-us/library/dt5dakze.aspx

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20150524/256c2db3/attachment.html>