[llvm-bugs] [Bug 27078] New: [ppc] slow data reorganization in VSX register (through memory)
via llvm-bugs
llvm-bugs at lists.llvm.org
Fri Mar 25 16:23:22 PDT 2016
https://llvm.org/bugs/show_bug.cgi?id=27078
Bug ID: 27078
Summary: [ppc] slow data reorganization in VSX register
(through memory)
Product: libraries
Version: trunk
Hardware: PC
OS: Linux
Status: NEW
Severity: normal
Priority: P
Component: Backend: PowerPC
Assignee: unassignedbugs at nondot.org
Reporter: carrot at google.com
CC: llvm-bugs at lists.llvm.org
Classification: Unclassified
Compile following code with options:
-mvsx -mcpu=power8 -g0 -O2
typedef float Vector3_f[3];
void foo(Vector3_f* blurred_row, int width, float* pixel, float pixel_diff_avg)
{
for (int j = 0; j < width; ++j, pixel += 3) {
float* blurred_pixel = blurred_row[j];
float pixel_diff[3];
pixel_diff[0] = blurred_pixel[0] - pixel[0];
pixel_diff[1] = blurred_pixel[1] - pixel[1];
pixel_diff[2] = blurred_pixel[2] - pixel[2];
pixel_diff[0] -= pixel_diff_avg;
pixel_diff[1] -= pixel_diff_avg;
pixel_diff[2] -= pixel_diff_avg;
pixel[0] += pixel_diff[0];
pixel[1] += pixel_diff[1];
pixel[2] += pixel_diff[2];
}
}
LLVM tries to vectorize the loop body,
...
stxsspx 36, 0, 25 // A1
xscvspdpn 8, 8
addi 25, 31, 308
xxsldwi 39, 2, 2, 3
xscvspdpn 10, 10
ld 28, 88(31) # 8-byte Folded Reload
xxsldwi 38, 0, 0, 1
xscvspdpn 12, 12
stxsspx 7, 0, 28 // A2
xxsldwi 2, 2, 2, 2
xscvspdpn 4, 39
ld 28, 80(31) # 8-byte Folded Reload
xxsldwi 3, 3, 3, 1
xscvspdpn 13, 38
xxsldwi 42, 0, 0, 3
stxsspx 9, 0, 28 // A3
stxsspx 11, 0, 26 // A4
xscvspdpn 0, 0
lxvd2x 7, 0, 26 // A5
ld 24, 72(31) # 8-byte Folded Reload
xscvspdpn 2, 2
xscvspdpn 6, 42
stxsspx 32, 0, 24 // B1
xscvspdpn 3, 3
addi 24, 31, 304
ld 28, 64(31) # 8-byte Folded Reload
xxswapd 11, 7
stxsspx 35, 0, 28 // B2
stxsspx 37, 0, 23 // B3
addi 28, 31, 224
stxsspx 33, 0, 22 // B4
ori 2, 2, 0
lxvd2x 9, 0, 22 // B5
stxsspx 41, 0, 21 // C1
stxsspx 8, 0, 20 // C2
stxsspx 10, 0, 19 // C3
stxsspx 12, 0, 18 // C4
xxswapd 10, 40
lxvd2x 8, 0, 18 // C5
stxsspx 43, 0, 17 // D1
stxsspx 13, 0, 16 // D2
stxsspx 4, 0, 15 // D3
stxsspx 5, 0, 14 // D4
xxswapd 12, 9
lxvd2x 4, 0, 14 // D5
stxsspx 0, 0, 6 // E1
stxsspx 6, 0, 3 // E2
stxsspx 2, 0, 11 // E3
stxsspx 3, 0, 28 // E4
xxswapd 6, 8
lxvd2x 2, 0, 28 // E5
...
A[1..4] arrange 4 fp value in memory, A5 loads it into vector register,
similarly B[1..5], C[1..5], D[1..5], E[1..5] reorganize different values into
vector registers. The problem is the A4 is very close to A5, it triggers the
very slow store forwarding on power8. In perf result, almost all time is
consumes by these loads.
I expect directly shuffling these values in registers is much faster.
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20160325/114cb7d3/attachment.html>
More information about the llvm-bugs
mailing list