[LLVMdev] thinking about timing-test-driven scheduler

orthochronous orthochronous at gmail.com
Fri Jun 11 13:49:03 PDT 2010


On Fri, Jun 11, 2010 at 3:17 PM,  <Kalle.Raiskila at nokia.com> wrote:
> On Wed, 2010-06-09 at 17:30 +0200, orthochronous wrote:

>> I've been thinking about how to implement a framework for attempting
>> instruction scheduling of small blocks of code by using (GA/simulated
>> annealing/etc) controlled timing-test-evaluations of various
>> orderings.
>
> This sounds interesting.
>
>> (I'm particularly interested small-ish numerical inner loop
>> code in low-power CPUs like Atom and various ARMs where there CPU
>> doesn't have the ability to "massage" the compiler's scheduling.)
>
> Have you tried to do some benchmarking here? E.g. high-end x86 vs. Atom
> processors? Does the current scheduler do a bad job here?

Some caveats: I was looking at LLVM because it's jit capability ought
to make it useful for doing GA based stuff rather than the intrinsic
quality of the LLVM scheduler. I'm more of an image
processing/scientific computing guy rather than a compiler guy, but
I'd expect that for common unpredictable branchy code (especially on
OOO cpus) the difference between a good scheduling and a close to
optimal schedule is probably in the noise. For the kind of non-branchy
numeric code I'd expect that IF THE STOCHASTIC SEARCH CAN BE MADE TO
ACTUALLY WORK you might be able to get 5-10 percent consistently.

I don't have access to a high-end x86 at the moment; here's some
benchmark results on an Atom N270 that I don't completely understand,
and may well be wrong. I need to dig in further and see precisely
what's happening. The program evaluates both

uint64_t
intrinsicsFn(__m128i* __restrict in,__m128i* __restrict out){
    uint64_t startTSC,endTSC;
    int i=0;
    rdtscll(startTSC);
    do{
        __m128i xmm0=in[0/16];
        __m128i xmm1=in[16/16];
        __m128i xmm2=in[32/16];
        __m128i xmm3=in[48/16];
        xmm0=_mm_add_epi16(in[64/16],xmm0);
        xmm1=_mm_add_epi16(in[80/16],xmm1);
        xmm2=_mm_add_epi16(in[96/16],xmm2);
        xmm3=_mm_add_epi16(in[112/16],xmm3);
        xmm0=_mm_add_epi16(xmm1,xmm0);
        xmm2=_mm_add_epi16(xmm3,xmm2);
        xmm0=_mm_add_epi16(xmm2,xmm0);
        out[0/16]=xmm0;
        in+=8;out+=1;
    }while(++i<LIM);
    rdtscll(endTSC);
    return endTSC-startTSC;
}

and the 9072 different valid reorderings of the hopefully equivalent
inline assembly style function

uint64_t f9072(__m128i * __restrict in,__m128i * __restrict out){
uint64_t startTSC,endTSC;
int i=0;
rdtscll(startTSC);
do{
__asm__ __volatile__(
"\tmovdqa 0(%1),%%xmm0\n"
"\tmovdqa 16(%1),%%xmm1\n"
"\tmovdqa 32(%1),%%xmm2\n"
"\tmovdqa 48(%1),%%xmm3\n"
"\tpaddw 64(%1),%%xmm0\n"
"\tpaddw 80(%1),%%xmm1\n"
"\tpaddw 96(%1),%%xmm2\n"
"\tpaddw 112(%1),%%xmm3\n"
"\tpaddw %%xmm1,%%xmm0\n"
"\tpaddw %%xmm3,%%xmm2\n"
"\tpaddw %%xmm2,%%xmm0\n"
"\tmovdqa %%xmm0,(%0)\n"
:"=r" (out):"r" (in):"xmm0","xmm1","xmm2","xmm3","xmm4","xmm5","xmm6","xmm7");
in+=8;out+=1;
}while(++i<LIM);
rdtscll(endTSC);
return endTSC-startTSC;
}

Note that my inline asm isn't that good, so the array ptr increments
are forced to be at end for the asm versions, which is probably not
good for speed.

I'd like to try a bigger inner loop but the number of combinations
obviously grows too rapidly to try this way rather than via a
stochastic sampling approach. Running those through g++-4.4 (which
DOESN'T have an atom pipeline model) and clang++ trunk gives the
attached graph (y is proportional to runtime whilst x is scheduling
instance, ordered by increasing y value).

The minimum g++ inline assembly: 73200
The time for g++ -O3 optimisation of intrinsics fn: 93712
The minimum clang++ inline assembly: 79275
The time for clang++ -O3 optimisation of intrinsics fn: 72581

Speculations: the clang intrinsics is finding some optimisation that
it is being blocked on the asm version, so it's not just the SSE
scheduling that's being measured. g++ is doing systematically some
optimisation on the asm versions that clang isn't, hence the rough
"curve offset". For some reason g++ is picking a noticeably worse
ordering for the intrinsics; I need to find an ubuntu package for
g++-4.5 and see if the atom pipeline model improves things.

(I can make the hacked together benchmark code available by email if
anyone wants it.)

This probably raises as many questions as it answers... but it
suggests the idea may be worthwhile for really intense inner loop
code.

Regards,
David Steven Tweed
-------------- next part --------------
A non-text attachment was scrubbed...
Name: timeForSchedules.ps
Type: application/postscript
Size: 255628 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20100611/aa23896d/attachment.ps>


More information about the llvm-dev mailing list