[PATCH] D39976: [AArch64] Consider the cost model when folding loads and stores

Fri Nov 17 10:49:06 PST 2017

evandro added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp:696
+    // ... the new instruction is not more complex than both old ones.
+    if (AUops > 1 || BUops > 1)
+      return NewUops <= (AUops + BUops);
----------------
junbuml wrote:
> Can you add more comment why you do this? Is this target independent ? 
I'm not a hardware designer, but AFAIK many targets hiccup when an instruction is decoded into more than one uop.  How bad the hiccup is, if at all, does depend on the target though.  Some decrease the decode bandwidth, typically by inserting a bubble whose size depends on the design.  This heuristic is an attempt at mitigating the new instruction inducing such a bubble.

If the new instruction has a shorter latency, then it's chosen.  One might wonder if it's still a good choice if it induces a bubble, but I could not devise a satisfying heuristic.

If the latency of the new instruction is the same as the combined latency of the both old ones, then the potential of inducing a bubble is considered.  If either of the old instructions had multiple uops, then even if the new one has them too it's probably no worse than before.  However, if neither of the old instructions resulted in multiple uops, the new one is chosen only if it results in fewer uops than before.  One might argue that, if bubbles when decoding into multiple uops are the norm among targets, it'd be better to choose the new instruction only if it doesn't potentially induce bubbles itself.

If the new instruction has a longer latency, then it's discarded.  Again, if it mitigates decode bubbles it might still be profitable, but the conditions seem hard to weigh in general.

https://reviews.llvm.org/D39976