[PATCH] Avoid generating SHLD/SHRD for architectures that are known to have poor latency for these instructions.
nrotem at apple.com
Fri Nov 15 15:39:25 PST 2013
It sounds like modern processors don’t implement SHLD/SHRD efficiently. Do we know of any x86 processors that DO benefit from this peephole? We may be able to delete this code altogether.
On Nov 15, 2013, at 3:34 PM, Eric Christopher <echristo at gmail.com> wrote:
> Awesome. This should probably be turned on at least for Ivy Bridge
> where we have numbers.
> Nadav: ?
> On Fri, Nov 15, 2013 at 3:08 PM, Katya Romanova
> <Katya_Romanova at playstation.sony.com> wrote:
>> Hi Eric,
>> I didn't have access to the machine with the one of the latest Intel's processors. So, I asked one of my friends, Dmitry Babokin, who works on ISPC compiler in Moscow, to do this performance testing on one of Intel's latest architectures. I generated 2 assembly files with LLVM compiler (with and without SHLD) for the following test:
>> int64_t s128(uint64_t a, uint64_t b, int shift)
>> return (a << shift) | (b >> (64-shift));
>> uint64_t s128i(uint64_t a, uint64_t b)
>> return s128(a, b, 7);
>> Dmitry ran called s128i function 100 million times. The test with shld instruction took 2.18 sec to finish. The test using alternative sequence of instructions took 1.89 sec, which is 13.3 % faster. All the experiments were done on Ivy Bridge architecture.
>> Dmitri also confirmed that on Ivy Bridge Intel's compiler 13.0 generates code *without* shld instructions.
>> It will be nice to get a full list of Intel's architectures where shld instruction has very high latency.
More information about the llvm-commits