[LLVMdev] Macro-op fusion experiment

Sun Apr 17 09:59:31 PDT 2011

Hi Jacob,

As far as I know, an x86 'mov' instruction always uses an ALU resource.
According to Agner Fog's documents (http://www.agner.org/optimize/), it can
execute on port 0, 1 or 5 on recent architectures though. So it's not that
likely to be resource limited. But it still occupies an instruction slot
throughout the entire pipeline, costing power and potentially limiting other
actual arithmetic instructions from scheduling optimally. Also, it has a
latency of 1 cycle, while non-destructive instructions would shorten the
latency of dependent instructions.

My immediate concern is getting a reasonable estimate for how often this
macro-op fusion could be performed. This could then be used to evaluate
whether it's worth the added decoder complexity.

Cheers,
Nicolas

On Fri, Apr 8, 2011 at 7:27 PM, Jakob Stoklund Olesen <stoklund at 2pi.dk>wrote:

>
> On Apr 8, 2011, at 9:56 AM, NAKAMURA Takumi wrote:
>
> >>>                 8B C3 mov eax, ebx
> >>>                 03 C1 add eax, ecx
> >>> becomes
> >>>                 8B C3 03 C1 add eax, ebx, ecx
> >
> > In my understanding, twoaddr pass tends to emit such a sequence.
>
> Yes, it always does, and the coalescer tries very hard to eliminate the
> copy.
>
> > Though I don't have sandybridge, I have not measured.
> > Prior processors(intel and amd) might spend 1 ALU to execute "mov",
> > then mov - add must have dependency.
>
> I think you will find it is more complicated than that. A 'mov' usually
> doesn't need an ALU resource.
>
> You should read about the 'reservation station' style register renaming.
>
> http://en.wikipedia.org/wiki/Register_renaming
> http://www.intel.com/Assets/PDF/manual/248966.pdf
>
> /jakob
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20110417/8f1cd34f/attachment.html>