[PATCH] D45098: [AArch64] fix PR32384: bump the number of stores per memset/memcpy/memmov
Sebastian Pop via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Tue Apr 3 12:53:21 PDT 2018
sebpop added a subscriber: SirishP.
sebpop added a comment.
I just helped the compiler with restrict and I see a pretty good code generated out of this example:
void fun(char * restrict in, char * restrict out) {
memcpy(out, in, 100);
}
llvm produces:
ldp q0, q1, [x0, #64]
stp q0, q1, [x1, #64]
ldp q0, q1, [x0, #32]
stp q0, q1, [x1, #32]
ldp q0, q1, [x0]
ldr w8, [x0, #96]
str w8, [x1, #96]
stp q0, q1, [x1]
ret
And here is the testcase I was looking at before producing the mix of ldr/str:
void fun(char *in, char *out) {
memcpy(out, in, 100);
}
the mi-scheduler is unable to move ldr past str:
ldr w8, [x0, #96]
str w8, [x1, #96]
ldr q0, [x0, #80]
str q0, [x1, #80]
ldr q0, [x0, #64]
str q0, [x1, #64]
ldr q0, [x0, #48]
str q0, [x1, #48]
ldr q0, [x0, #32]
str q0, [x1, #32]
ldr q0, [x0, #16]
str q0, [x1, #16]
ldr q0, [x0]
str q0, [x1]
ret
For this to work, the code generator expanding memcpy in getMemcpyLoadsAndStores()
needs to be amended to produce more than one ldr/str at a time.
The target should be able to specify the number of consecutive loads and stores to be produced.
In the case of generic aarch64 that should be 2 such that we can produce a ldp; stp; sequence.
For Exynos processors that should be a much higher number like 8 as it is better to have all loads and all stores scheduled together.
Sirish is working on a patch for that.
https://reviews.llvm.org/D45098
More information about the llvm-commits
mailing list