<div dir="ltr">I'm not trying to insist on a separate pass vs. embedding this somewhere else. We have stall and hazard fixing passes already?<br><div><br></div><div>Other than the specific place, I agree with the overall approach.</div><div><br></div><div>I also want to push for, where ever possible, choosing an encoding which will be good across a wide range of processors. While it is nice when we can build binaries for the *exact* right microarchitecture, we should minimize the degree to which this changes code generation. If for no other reason, it makes things really brittle and hard to predict. For example, random and inexplicable performance swings due to irrelevant variations in which loops are sitting on which cache lines. Consistency of encoding and emission is really useful for making the performance of programs more predictable and consistent over time here.</div></div><br><div class="gmail_quote">On Tue, May 12, 2015 at 12:16 PM Smith, Kevin B <<a href="mailto:kevin.b.smith@intel.com">kevin.b.smith@intel.com</a>> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Thanks for the support Chandler.  I am starting to work on this.<br>

<br>

My initial thoughts are:<br>

<br>

1 - A very late pass through the MachineInstrs that would be inserted as part of X86PassConfig::addPreEmitPass.<br>

<br>

2 - Initially look for 8 bit and 16 bit operations that would be better expanded into 32 bit operations.<br>

      - There could be some different reasons to do this<br>

         a - Specifically for the case in PR23155 where false dependence potentially slows execution.<br>

         b - Just in general for cases where partial registers may cost something (Intel X86 prior to Haswell)<br>

         c - cases where code could be saved by using an equivalent 32 bit instruction, such as 16 bit instructions that<br>

               would encode shorter as 32 bit. We want to do this very late to allow for folding memory operations into the 16 and 8 bit<br>

              operations, and not rely on heuristics to try to predict about this.<br>

<br>

If you have any comments or disagreements with that direction please let me know.<br>

<br>

Kevin B. Smith<br>

<br>

-----Original Message-----<br>

From: Chandler Carruth [mailto:<a href="mailto:chandlerc@gmail.com" target="_blank">chandlerc@gmail.com</a>]<br>

Sent: Tuesday, May 12, 2015 10:49 AM<br>

To: Smith, Kevin B; <a href="mailto:qcolombet@apple.com" target="_blank">qcolombet@apple.com</a>; <a href="mailto:chisophugis@gmail.com" target="_blank">chisophugis@gmail.com</a>; <a href="mailto:llvm-dev@redking.me.uk" target="_blank">llvm-dev@redking.me.uk</a>; Demikhovsky, Elena; <a href="mailto:spatel@rotateright.com" target="_blank">spatel@rotateright.com</a><br>

Cc: Kuperstein, Michael M; <a href="mailto:ahmed.bougacha@gmail.com" target="_blank">ahmed.bougacha@gmail.com</a>; <a href="mailto:llvm-commits@cs.uiuc.edu" target="_blank">llvm-commits@cs.uiuc.edu</a><br>

Subject: Re: [PATCH] PR 23155 - Improvement to X86 16 bit operation promotion for better performance.<br>

<br>

In <a href="http://reviews.llvm.org/D9209#162887" target="_blank">http://reviews.llvm.org/D9209#162887</a>, @kbsmith1 wrote:<br>

<br>

> From Chandler's comments in 22473:<br>

>  We need to add a pass that replaces movb (and movw) with movzbl (and movzwl) when the destination is a register and the high bytes aren't used. Then we need to benchmark bzip2 to ensure that this recovers all of the performance that forcing the use of cmpl did, and probably some other sanity benchmarking. Then we can swap the cmpl formation for the movzbl formation.<br>

><br>

> I am in agreement that this would be a good solution.  If you, Chandler, and Eric all like that direction, I will be willing to work on that.  I also have access to SPEC benchmarks, both 2000 and 2006 to be able to benchmark as well for bzip2 specifically since that is something the community considers important.<br>

<br>

<br>

I would be *very* interested in this, and would love it if you could work on it. I suspect you're in a much better position to implement, document, and evaluate the results. We really need to kill the 'cmpl' hack that is currently used.<br>

<br>

<br>

REPOSITORY<br>

  rL LLVM<br>

<br>

<a href="http://reviews.llvm.org/D9209" target="_blank">http://reviews.llvm.org/D9209</a><br>

<br>

EMAIL PREFERENCES<br>

  <a href="http://reviews.llvm.org/settings/panel/emailpreferences/" target="_blank">http://reviews.llvm.org/settings/panel/emailpreferences/</a><br>

<br>

<br>

<br>

_______________________________________________<br>

llvm-commits mailing list<br>

<a href="mailto:llvm-commits@cs.uiuc.edu" target="_blank">llvm-commits@cs.uiuc.edu</a><br>

<a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits</a><br>

</blockquote></div>