[PATCH] Avoid generating SHLD/SHRD for architectures that are known to have poor latency for these instructions.
nrotem at apple.com
Fri Nov 15 10:35:38 PST 2013
On Nov 14, 2013, at 9:43 PM, Katya Romanova <Katya_Romanova at playstation.sony.com> wrote:
> kromanova added you to the CC list for the revision "Avoid generating SHLD/SHRD for architectures that are known to have poor latency for these instructions.".
> SHLD/SHRD are VectorPath (microcode) instructions known to have poor latency on certain architectures.
> While generating shld/shrd instructions is acceptable when optimizing for size, optimizing for speed on these platforms should be implemented using alternative sequences of instructions composed of add, adc, shr, and lea which are directPath instructions. These alternative instructions not only have a lower latency but they also increase the decode bandwidth by allowing simultaneous decoding of a third directPath instruction.
> return x >> 7 | y << 57;
> The generated instruction sequence is:
> shld $7 , %rax , %rdx
> we should actually prefer:
> shl $57 , %rax
> shr $7 , %rdx
> or %rax , %rdx
> which are all DirectPath instructions.
> AMD's processors family K7, K8, K10, K12, K15 and K16 are known to have SHLD/SHRD instructions with very poor latency. Optimization guides for these processors recommend using an alternative sequence of instructions.
> I couldn't find optimization guides for AMD's processors family K14 and on the Web, but actual performance measurements showed 30% speedup for Bobcat (family K14). I'd like to get confirmation from the community's AMD experts that family K14 processors have poor latency SHLD/SHRD instructions.
> Experiments on Ivy Bridge showed 15% improvement, when an alternative sequence of instructions was generated (thanks to Dmitry Babokin from Intel for running the performance measurements for me). I would also like to hear from Intel experts. If you know which Intel's processors should have a flag "have poor latency for SHLD/SHRD instructions" - please let me know.
> Here are the references to AMD's processors optimization guide:
> K7 families: http://www.bartol.udel.edu/mri/sam/Athlon_code_optimization_guide.pdf
> Athlon, Athlon-tbird, Athlon-4, Athlon-xp, Athlon-mp
> K8 families: http://developer.amd.com/wordpress/media/2012/10/25112.pdf
> Athlon64, Opteron, AMD 64 FX, AMD k8-sse, AMD Athlon64-sse3, AMD Opteron-sse3
> K10 and K12:
> -> Software Optimization Guide for AMD Family 10h and 12h Processors
> AMD btver1 (Bobcat)
> -> Couldn't find Optimization guide for AMD Fam 14, but I think shld documentation is applicable for Bobcat as well.
> -> search for "Software Optimization Guide for AMD Family 15h Processors"
> bdver1 (Bulldozer), bdver2 (Piledriver)
> btver2 (Jaguar)
> -> http://developer.amd.com/resources/documentation-articles/developer-guides-manuals/
> Description of the changes:
> Introduced a new feature FeatureSlowSHLD that should be set up for the architectures that are
> known to have SHLD/SHRD instructions with very poor latency.
> Enabled this feature for all AMD's family K8-K16 architectures.
> Don't fold (or (x << c) | (y >> (64 - c))) if SHLD/SHRD instructions
> have high latencies and we are not optimizing for size.
> Set IsSHLDSlow to false by default.
> When autodetecting subtarget features - set IsSHLDSlow to true for AMD processors.
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
More information about the llvm-commits