[llvm] r244601 - [X86] Allow merging of immediates within a basic block for code size savings

Thu Aug 13 12:07:21 PDT 2015

Hi Sanjay / Sean,

Just a few comments:

I think 24448 will be addressed by just enabling my patch for O2. With some small tweaks, it should be easy to have it catch the cases in 24447, also (I’ve played around with merging address computations in a local workspace).

My thoughts on the cost of the extra mov-imm are that it’s a fairly insignificant overhead. The resource usage is pretty minimal and shouldn’t have an impact on performance. If you’re concerned with front-end execution/resources like decoding, take into consideration that the reduced size of the instructions should more than make up for any overhead.

If the H/W is fetching 16B of instructions per cycle, for example, we’re fetching around 1.5 instructions per clock with store-imm, while we can fetch almost 3 per clock with store-reg,

So in 1 clock, we can either fetch/decode this:

store <imm>, <mem> (huge 10 bytes)

or this:

xor <reg>, <reg>
store <reg>, <mem>
store <reg>, <mem>

With mov-imm instead of xor, it gets a little bigger, but still a win. The first mov-imm+store will decode in the same clock as the first store-imm, and the subsequent ones will have higher throughput.
Most modern architectures (larger cores) will try to keep the uops in cache, anyway, but we can still hit issues where uops are cached in accordance to IP alignment, as is the case with Intel’s DSB, for example. Larger instructions will results in uop cache lines going to waste.

Regarding the Intel Perf Rule cited below, I believe that was specific to Pentium4 and how immediates were packed into the trace cache. Pentium4 is a “who cares” today, along with the trace cache (thankfully ☺ ).

Regarding 24449, the optimization would be nice to do, as long as we’re careful we don’t create additional hazards with the larger memory instructions. Specifically, we don’t want to start splitting cache lines.

Thanks,
Zia.

From: Sanjay Patel [mailto:spatel at rotateright.com]
Sent: Thursday, August 13, 2015 11:26 AM
To: Sean Silva
Cc: Kuperstein, Michael M; llvm-commits at lists.llvm.org; Quentin Colombet; Ansari, Zia; Nadav Rotem; Hal Finkel
Subject: Re: [llvm] r244601 - [X86] Allow merging of immediates within a basic block for code size savings

Filed as:
https://llvm.org/bugs/show_bug.cgi?id=24447
https://llvm.org/bugs/show_bug.cgi?id=24448
https://llvm.org/bugs/show_bug.cgi?id=24449
The last one looks like the easiest one to solve and probably offers the most upside given that you're seeing mostly zeros being stored.

On Thu, Aug 13, 2015 at 9:21 AM, Sanjay Patel <spatel at rotateright.com<mailto:spatel at rotateright.com>> wrote:

On Wed, Aug 12, 2015 at 6:33 PM, Sean Silva <chisophugis at gmail.com<mailto:chisophugis at gmail.com>> wrote:

For reference, `mov [mem],imm` is decoded into 2 micro-ops (see "Table 1. Typical Instruction Mappings" in [SOG]) whereas `mov [mem],reg` is only 1 micro-op, so it is *preferable* to use a reg since it amortizes the cost of the `mov-imm` micro-op across the stores.

Wow, I never noticed that line in the table. So whatever we do may have to be specialized further by micro-arch...
But the Intel Perf guide has this gem at Rule 39:
"Try to schedule μops that have no immediate immediately before or after μops with 32-bit immediates."

 ...so maybe it's a no-brainer for everyone after all. :)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150813/df1bab59/attachment.html>