[PATCH] D118563: [AARCH64][NEON] Reuse extended vdup value in low version of long operations when doing tryCombineLongOpWithDup
Sunho Kim via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Sun Jan 30 05:31:47 PST 2022
sunho created this revision.
Herald added subscribers: hiraditya, kristof.beyls.
sunho requested review of this revision.
Herald added a project: LLVM.
Herald added a subscriber: llvm-commits.
Fixes https://github.com/llvm/llvm-project/issues/53261
tryCombineLongOpWithDup is a pass in Selection DAG combine stage that extends vdup vector to double sized so that patterns of high version of long operations (such as umull2, sabal2) can be used. The problem occurs when the original vdup vector is also used in low version of long operation. High version will be patched to use extended vdup vector, but low version will be in tact. This causes redundant vdup instructions to be generated because low version long operation is not reusing extended vdup vector where it can. Following is the code excerpts from the github issue demonstrating redundant vdup instructions.
cpp
uint8x16_t compare_ok(uint8x16_t x, uint8x16_t y, uint8x16_t X, uint8x16_t Y) {
auto xx = vmull_u8(vget_low_u8(x), vget_low_u8(X));
xx = vmlsl_u8(xx, vget_low_u8(y), vget_low_u8(Y));
auto XX = vmull_u8(vget_high_u8(x), vget_high_u8(X));
XX = vmlsl_u8(XX, vget_high_u8(y), vget_high_u8(Y));
auto top_byte = vuzpq_u8(vreinterpretq_u8_u16(xx), vreinterpretq_u8_u16(XX)).val[1];
return vshrq_n_u8(top_byte, 7);
}
// optimal usage of registers
umull v4.8h, v0.8b, v2.8b
umull2 v0.8h, v0.16b, v2.16b
umlsl v4.8h, v1.8b, v3.8b
umlsl2 v0.8h, v1.16b, v3.16b
uzp2 v0.16b, v4.16b, v0.16b
ushr v0.16b, v0.16b, #7
uint8x16_t compare_fail(uint8x16_t x, uint8x16_t y) {
return compare_ok(x,y, vdupq_n_u8(33), vdupq_n_u8(119));
}
// same constant is allocated both to 8-byte and 16-byte registers
movi v2.8b, #33
movi v5.8b, #119
movi v3.16b, #33
movi v4.16b, #119
umull v2.8h, v0.8b, v2.8b
umull2 v0.8h, v0.16b, v3.16b
umlsl v2.8h, v1.8b, v5.8b
umlsl2 v0.8h, v1.16b, v4.16b
uzp2 v0.16b, v2.16b, v0.16b
ushr v0.16b, v0.16b, #7
uint8x16_t compare_fail(uint8x16_t x, uint8x16_t y, uint8x8_t coeffs) {
return compare_ok(x,y, vdupq_lane_u8(coeffs, 0), vdupq_lane_u8(coeffs, 1));
}
// same constant is allocated both to 8-byte and 16-byte registers
dup v3.8b, v2.b[0]
dup v4.16b, v2.b[0]
dup v5.8b, v2.b[1]
dup v2.16b, v2.b[1]
umull v3.8h, v0.8b, v3.8b
umull2 v0.8h, v0.16b, v4.16b
umlsl v3.8h, v1.8b, v5.8b
umlsl2 v0.8h, v1.16b, v2.16b
uzp2 v0.16b, v3.16b, v0.16b
ushr v0.16b, v0.16b, #7
This patch fixed this issue by iterating through the users of original vdup vector and patching the low version of long operations to use extract_low(new extended vdup vector).
Repository:
rG LLVM Github Monorepo
https://reviews.llvm.org/D118563
Files:
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
llvm/test/CodeGen/AArch64/aarch64-combine-long-op-dup-noduplicate.ll
-------------- next part --------------
A non-text attachment was scrubbed...
Name: D118563.404367.patch
Type: text/x-patch
Size: 10377 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20220130/2319dd10/attachment.bin>
More information about the llvm-commits
mailing list