[PATCH] [AArch64] Inline memcpy() as a sequence of ldp-stp with 64-bit registers

Wed Nov 12 09:16:51 PST 2014

>> In this case it could ignore latency of ldp when it's followed by stp with same operands.
> I don't think that's right. We're not magically going to make the stp less quick, we can just issue them back to back in the same cycle.

Well, I didn't assume there is some magic, what I meant is that when scheduler looks for the next instruction after `stp`, `ldp` should be the best match among all predecessors.

> Potentially a ScheduleHazardRecognizer might be the right thing here?

>From its description, I'd say that it does the opposite: allows to postpone execution of some instruction till the next cycle.

Dave's advice to look at clustering in scheduler applied through DAG mutations almost worked, the only issue is that some "free" instructions can still be inserted between `ldp` and `stp` (previously it were instructions that compute addresses), but not sure this can be solved using clustering. The next thing is custom scheduling strategy, it might be an option, but I just started trying adding it.

http://reviews.llvm.org/D6054