[llvm-bugs] [Bug 31333] New: alloca in local memory not promoted to registers

via llvm-bugs llvm-bugs at lists.llvm.org
Fri Dec 9 15:23:59 PST 2016


https://llvm.org/bugs/show_bug.cgi?id=31333

            Bug ID: 31333
           Summary: alloca in local memory not promoted to registers
           Product: libraries
           Version: trunk
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P
         Component: Backend: PTX
          Assignee: unassignedbugs at nondot.org
          Reporter: andrew.b.adams at gmail.com
                CC: llvm-bugs at lists.llvm.org
    Classification: Unclassified

Created attachment 17747
  --> https://llvm.org/bugs/attachment.cgi?id=17747&action=edit
ll that reproduces

The attached .ll uses an alloca of 64 floats to do a matrix multiply block.
Every access to this array is at a constant offset. The generated ptx includes
a ton of traffic to and from .local:

...
    st.local.f32     [%rd5+184], %f143;
    ld.local.f32     %f144, [%rd5+188];
    fma.rn.f32     %f145, %f139, %f93, %f144;
    st.local.f32     [%rd5+188], %f145;
    ld.local.f32     %f146, [%rd5+208];
    fma.rn.f32     %f147, %f126, %f108, %f146;
    st.local.f32     [%rd5+208], %f147;
    ld.local.f32     %f148, [%rd5+212];
    fma.rn.f32     %f149, %f129, %f108, %f148;
    st.local.f32     [%rd5+212], %f149;
    ld.local.f32     %f150, [%rd5+240];
    fma.rn.f32     %f151, %f126, %f113, %f150;
    st.local.f32     [%rd5+240], %f151;
    ld.local.f32     %f152, [%rd5+244];
    fma.rn.f32     %f153, %f129, %f113, %f152;
    st.local.f32     [%rd5+244], %f153;
    ld.local.f32     %f154, [%rd5+216];
    fma.rn.f32     %f155, %f136, %f108, %f154;
    st.local.f32     [%rd5+216], %f155;
    ld.local.f32     %f156, [%rd5+220];
...

If you instead do 64 scalar allocas of one float each, all of this traffic goes
away, and the kernel gets 10 times faster! 

The kernel uses vector types, but the problem persists without them too.

Interestingly, if I reduce the size of the alloca to 4 (computing a 2x2 block
of the output matrix instead of 8x8), then the loads from local go away, but
the stores remain. So it's keeping two copies of the values. I've included this
simpler kernel too.

I'm compiling the kernel with:

llc kernel.ll -filetype=asm -march=nvptx -O3  -o -

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20161209/4000dc29/attachment.html>


More information about the llvm-bugs mailing list