[PATCH] Avoid generating SHLD/SHRD for architectures that are known to have poor latency for these instructions.

Katya Romanova Katya_Romanova at playstation.sony.com
Thu Nov 14 21:43:42 PST 2013

kromanova added you to the CC list for the revision "Avoid generating SHLD/SHRD for architectures that are known to have poor latency for these instructions.".

SHLD/SHRD are VectorPath (microcode) instructions known to have poor latency on certain architectures.
While generating shld/shrd instructions is acceptable when optimizing for size, optimizing for speed on these platforms should be implemented using alternative sequences of instructions composed of add, adc, shr, and lea which are directPath instructions. These alternative instructions not only have a lower latency but they also increase the decode bandwidth by allowing simultaneous decoding of a third directPath instruction.

    return x >> 7 | y << 57;

The generated instruction sequence is:
    shld $7 , %rax , %rdx

we should actually prefer:
    shl $57 , %rax
    shr $7 , %rdx
    or %rax , %rdx

which are all DirectPath instructions.

AMD's processors family K7, K8, K10, K12, K15 and K16 are known to have SHLD/SHRD instructions with very poor latency. Optimization guides for these processors recommend using an alternative sequence of instructions.

I couldn't find optimization guides for AMD's processors family K14 and on the Web, but actual performance measurements showed 30% speedup for Bobcat (family K14). I'd like to get confirmation from the community's AMD experts that family K14 processors have poor latency SHLD/SHRD instructions.

Experiments on Ivy Bridge showed 15% improvement, when an alternative sequence of instructions was generated (thanks to Dmitry Babokin from Intel for running the performance measurements for me). I would also like to hear from Intel experts. If you know which Intel's processors should have a flag "have poor latency for SHLD/SHRD instructions" - please let me know.

Here are the references to AMD's processors optimization guide:
K7 families: http://www.bartol.udel.edu/mri/sam/Athlon_code_optimization_guide.pdf
Athlon, Athlon-tbird, Athlon-4, Athlon-xp, Athlon-mp

K8 families: http://developer.amd.com/wordpress/media/2012/10/25112.pdf
Athlon64, Opteron, AMD 64 FX, AMD k8-sse, AMD Athlon64-sse3, AMD Opteron-sse3

K10 and K12:
-> Software Optimization Guide for AMD Family 10h and 12h Processors

AMD btver1 (Bobcat)
-> Couldn't find Optimization guide for AMD Fam 14, but I think shld documentation is applicable for Bobcat as well.

-> search for "Software Optimization Guide for AMD Family 15h Processors"
bdver1 (Bulldozer), bdver2 (Piledriver)

btver2 (Jaguar)
-> http://developer.amd.com/resources/documentation-articles/developer-guides-manuals/

Description of the changes:

Introduced a new feature FeatureSlowSHLD that should be set up for the architectures that are
known to have SHLD/SHRD instructions with very poor latency.
Enabled this feature for all AMD's family K8-K16 architectures.

Don't fold (or (x << c) | (y >> (64 - c))) if SHLD/SHRD instructions 
have high latencies and we are not optimizing for size.

Set IsSHLDSlow to false by default.
When autodetecting subtarget features - set IsSHLDSlow to true for AMD processors.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: D2177.1.patch
Type: text/x-patch
Size: 17250 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20131114/2e6fad5d/attachment.bin>

More information about the llvm-commits mailing list