[PATCH] D45098: [AArch64] fix PR32384: bump the number of stores per memset/memcpy/memmov

Tue Apr 3 12:53:21 PDT 2018

sebpop added a subscriber: SirishP.
sebpop added a comment.

I just helped the compiler with restrict and I see a pretty good code generated out of this example:

  void fun(char * restrict in, char * restrict out) {
    memcpy(out, in, 100);
  }

llvm produces:

  	ldp	q0, q1, [x0, #64]
  	stp	q0, q1, [x1, #64]
  	ldp	q0, q1, [x0, #32]
  	stp	q0, q1, [x1, #32]
  	ldp	q0, q1, [x0]
  	ldr	w8, [x0, #96]
  	str	w8, [x1, #96]
  	stp	q0, q1, [x1]
  	ret

And here is the testcase I was looking at before producing the mix of ldr/str:

  void fun(char *in, char *out) {
    memcpy(out, in, 100);
  }

the mi-scheduler is unable to move ldr past str:

  	ldr	w8, [x0, #96]
  	str	w8, [x1, #96]
  	ldr	q0, [x0, #80]
  	str	q0, [x1, #80]
  	ldr	q0, [x0, #64]
  	str	q0, [x1, #64]
  	ldr	q0, [x0, #48]
  	str	q0, [x1, #48]
  	ldr	q0, [x0, #32]
  	str	q0, [x1, #32]
  	ldr	q0, [x0, #16]
  	str	q0, [x1, #16]
  	ldr	q0, [x0]
  	str	q0, [x1]
  	ret

For this to work, the code generator expanding memcpy in getMemcpyLoadsAndStores()
needs to be amended to produce more than one ldr/str at a time.
The target should be able to specify the number of consecutive loads and stores to be produced.
In the case of generic aarch64 that should be 2 such that we can produce a ldp; stp; sequence.
For Exynos processors that should be a much higher number like 8 as it is better to have all loads and all stores scheduled together.

Sirish is working on a patch for that.

https://reviews.llvm.org/D45098