[LLVMdev] LLVM x86 Code Generator discards Instruction-level Parallelism

Evan Cheng evan.cheng at apple.com
Thu Nov 4 22:51:16 PDT 2010


The default x86 scheduler schedules to reduce register pressure since that's the primary concern for most code. list-ilp scheduler is designed to address the kind of issue you described but it's not quite done yet. Please file a bug and attach your microbenchmark.

Evan

On Nov 3, 2010, at 11:34 AM, Andrew R Kerr wrote:

> Dear LLVMdev,
> 
> I've noticed an unusual behavior of the LLVM x86 code generator (with default options)
> that results in nearly a 4x slow-down in floating-point throughput for my microbenchmark.
> 
> I've written a compute-intensive microbenchmark to approach theoretical peak
> throughput of the target processor by issuing a large number of independent
> floating-point multiplies. The distance between dependent instructions is at
> least four instructions to match the latency of the floating-point functional
> unit in my Intel Core2 Quad (Q9550 at 2.83 GHz).
> 
> The microbenchmark itself replicates the following block 512 times:
> 
> .
> .
> {
>  p1 = p1 * a;
>  p2 = p2 * b;
>  p3 = p3 * c;
>  p4 = p4 * d;
> }
> .
> .
> 
> Compiling with NVCC, Ocelot, and LLVM, I can confirm the interleaved instruction
> schedule with a four-instruction reuse distance. An excerpt follows:
> 
>  .
>  .
>  %r1500 = fmul float %r1496, %r24  ; compute %1500
>  %r1501 = fmul float %r1497, %r23
>  %r1502 = fmul float %r1498, %r22
>  %r1503 = fmul float %r1499, %r21
> 
>  %r1504 = fmul float %r1500, %r24  ; first use of %1500
>  %r1505 = fmul float %r1501, %r23
>  %r1506 = fmul float %r1502, %r22
>  %r1507 = fmul float %r1503, %r21
> 
>  %r1508 = fmul float %r1504, %r24  ; first use of %1504
>  .
>  .
> 
> The JIT compiler, however, seems to break the interleaving of independent instructions
> and issues a long sequence of instructions with back-to-back dependencies. It is as if
> all p1 = .. expressions are collected at once followed by all p2 = .. expressions and so
> forth.
> 
> p1 = p1 * a
> p1 = p1 * a
> .
> .
> 
> p2 = p2 * b
> p2 = p2 * b
> .
> .
> 
> p3 = p3 * c
> p3 = p3 * c
> .
> .
> 
> An actual excerpt of the generated x86 assembly follows:
> 
>  mulss        %xmm8, %xmm10
>  mulss        %xmm8, %xmm10
>  .
>  . repeated 512 times
>  .
> 
>  mulss        %xmm7, %xmm9
>  mulss        %xmm7, %xmm9
>  .
>  . repeated 512 times
>  .
> 
>  mulss        %xmm6, %xmm3
>  mulss        %xmm6, %xmm3
>  .
>  . repeated 512 times
>  .
> 
> Since p1, p2, p3, and p4 are all independent, this reordering is correct. This would have
> the possible advantage of reducing live ranges of values. However, in this microbenchmark,
> the number of live values is eight single-precision floats - well within SSE's sixteen
> architectural registers.
> 
> Moreover, the pipeline latency is four cycles meaning back-to-back instructions with true
> dependencies cannot issue immediately. The benchmark was intentionally written to avoid this
> hazard but LLVM's code generator seems to ignore that when it schedules instructions.
> 
> When I run this benchmark on my 2.83 GHz CPU, I observe the following performance results:
> 
>  1 threads       0.648891 GFLOP/s
>  2 threads       1.489049 GFLOP/s
>  3 threads       2.209838 GFLOP/s
>  4 threads       2.940443 GFLOP/s
> 
> When I rewrite the generated assembly by hand to exhibit the same interleaving as in the
> LLVM IR form
> 
>  .
>  .
>  mulss        %xmm8, %xmm10
>  mulss        %xmm7, %xmm9
>  mulss        %xmm6, %xmm3
>  mulss        %xmm5, %xmm11
> 
>  mulss        %xmm8, %xmm10
>  mulss        %xmm7, %xmm9
>  mulss        %xmm6, %xmm3
>  mulss        %xmm5, %xmm11
> 
>  mulss        %xmm8, %xmm10
>  mulss        %xmm7, %xmm9
>  .
>  .
> 
> observed performance increases by nearly a factor of four:
> 
>  1 threads       2.067118 GFLOP/s
>  2 threads       5.569419 GFLOP/s
>  3 threads       8.285519 GFLOP/s
>  4 threads       10.81742 GFLOP/s
> 
> I show similar results with packed single-precision floating point SIMD instructions. The
> LLVM-generated machine code is nearly 4x slower than the interleaved schedule 'suggested' by
> the LLVM IR input.
> 
>  Vectorized - No instruction interleaving - back-to-back dependencies
>    (issued by LLVM code generator)
> 
>  1 threads        1.540621 GFLOP/s
>  2 threads        5.900833 GFLOP/s
>  3 threads        8.755953 GFLOP/s
>  4 threads        11.257122 GFLOP/s
> 
>  Vectorized - Interleaved - stride-4 reuse distance
>    (hand-modified to interleave instructions)
> 
>  1 threads       3.157255 GFLOP/s
>  2 threads       22.104369 GFLOP/s
>  3 threads       32.300111 GFLOP/s
>  4 threads       39.112162 GFLOP/s
> 
> It is worth noting that 39.1 GFLOP/s is approaching the theoretical limits of the processor
> (stated to be 45.25 GFLOP/s single-precision).
> 
> I've used the llc utility to test various pre-register-allocation instruction schedulers
> with the following results:
> 
>  -pre-RA-sched
>    =default         - discards ILP
>    =list-burr       - discards ILP
>    =list-tdrr       - crashes during code generation
> 
>    =source          - preserves interleaved instruction ordering and ILP
> 
>    =list-hybrid     - discards ILP
>    =list-ilp        - discards ILP
>    =list-td         - crashes during code generation
>    =fast            - discards ILP
> 
> Can you comment?
> 
> -- 
> Andrew Kerr
> arkerr at gatech.edu
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev





More information about the llvm-dev mailing list