[llvm] r244601 - [X86] Allow merging of immediates within a basic block for code size savings

Thu Aug 13 14:06:14 PDT 2015

On Thu, Aug 13, 2015 at 12:07 PM, Ansari, Zia <zia.ansari at intel.com> wrote:

> Hi Sanjay / Sean,
>
>
>
> Just a few comments:
>
>
>
> I think 24448 will be addressed by just enabling my patch for O2. With
> some small tweaks, it should be easy to have it catch the cases in 24447,
> also (I’ve played around with merging address computations in a local
> workspace).
>
>
>
> My thoughts on the cost of the extra mov-imm are that it’s a fairly
> insignificant overhead. The resource usage is pretty minimal and shouldn’t
> have an impact on performance. If you’re concerned with front-end
> execution/resources like decoding, take into consideration that the reduced
> size of the instructions should more than make up for any overhead.
>
>
>
> If the H/W is fetching 16B of instructions per cycle,
>

Interesting ... on Jaguar, 32 bytes can be pulled from L1 per cycle. There
is also no uop cache, and the processor can do at most 2 inst/cycle decode.
This is a very interesting difference...

-- Sean Silva

> for example, we’re fetching around 1.5 instructions per clock with
> store-imm, while we can fetch almost 3 per clock with store-reg,
>
>
>
> So in 1 clock, we can either fetch/decode this:
>
>
>
> store <imm>, <mem> (huge 10 bytes)
>
>
>
> or this:
>
>
>
> xor <reg>, <reg>
>
> store <reg>, <mem>
>
> store <reg>, <mem>
>
>
>
> With mov-imm instead of xor, it gets a little bigger, but still a win. The
> first mov-imm+store will decode in the same clock as the first store-imm,
> and the subsequent ones will have higher throughput.
>
> Most modern architectures (larger cores) will try to keep the uops in
> cache, anyway, but we can still hit issues where uops are cached in
> accordance to IP alignment, as is the case with Intel’s DSB, for example.
> Larger instructions will results in uop cache lines going to waste.
>
>
>
> Regarding the Intel Perf Rule cited below, I believe that was specific to
> Pentium4 and how immediates were packed into the trace cache. Pentium4 is a
> “who cares” today, along with the trace cache (thankfully J ).
>
>
>
> Regarding 24449, the optimization would be nice to do, as long as we’re
> careful we don’t create additional hazards with the larger memory
> instructions. Specifically, we don’t want to start splitting cache lines.
>
>
>
> Thanks,
>
> Zia.
>
>
>
>
>
> *From:* Sanjay Patel [mailto:spatel at rotateright.com]
> *Sent:* Thursday, August 13, 2015 11:26 AM
> *To:* Sean Silva
> *Cc:* Kuperstein, Michael M; llvm-commits at lists.llvm.org; Quentin
> Colombet; Ansari, Zia; Nadav Rotem; Hal Finkel
> *Subject:* Re: [llvm] r244601 - [X86] Allow merging of immediates within
> a basic block for code size savings
>
>
>
> Filed as:
> https://llvm.org/bugs/show_bug.cgi?id=24447
> https://llvm.org/bugs/show_bug.cgi?id=24448
> https://llvm.org/bugs/show_bug.cgi?id=24449
>
> The last one looks like the easiest one to solve and probably offers the
> most upside given that you're seeing mostly zeros being stored.
>
>
>
>
>
> On Thu, Aug 13, 2015 at 9:21 AM, Sanjay Patel <spatel at rotateright.com>
> wrote:
>
>
>
>
>
> On Wed, Aug 12, 2015 at 6:33 PM, Sean Silva <chisophugis at gmail.com> wrote:
>
>
>
> For reference, `mov [mem],imm` is decoded into 2 micro-ops (see "Table 1.
> Typical Instruction Mappings" in [SOG]) whereas `mov [mem],reg` is only 1
> micro-op, so it is *preferable* to use a reg since it amortizes the cost of
> the `mov-imm` micro-op across the stores.
>
>
>
> Wow, I never noticed that line in the table. So whatever we do may have to
> be specialized further by micro-arch...
>
> But the Intel Perf guide has this gem at Rule 39:
> "Try to schedule μops that have no immediate immediately before or after
> μops with 32-bit immediates."
>
>
>
>  ...so maybe it's a no-brainer for everyone after all. :)
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150813/1b909943/attachment.html>