[AArch64] Enable partial unrolling on cortex-a57 and 2 related improvement

Tue Mar 3 19:29:03 PST 2015

----- Original Message -----
> From: "Kevin Qin" <kevinqindev at gmail.com>
> To: "llvm-commits" <llvm-commits at cs.uiuc.edu>
> Sent: Friday, February 27, 2015 11:55:05 PM
> Subject: [AArch64] Enable partial unrolling on cortex-a57 and 2 related	improvement
> 
> 
> 
> Hi,
> 
> 
> Previously, I made commit r219401 that try to enable partial &
> runtime unrolling on cortex-a57, but I forgot to call base TTI
> implementation in target specific hook, so those unrolling methods
> are not really enabled.
> 
> 
> Here are the patch to get them enabled and 2 related patches to
> improve it.
> 
> 
> 0001 - Run LICM pass after loop unrolling pass. Runtime unrollng will
> introduce a runtime check in loop prologue(you can treat it as a
> loop preheader). If the unrolled loop is a inner loop, then the
> proglogue will be inside the outer loop. LICM pass can help to
> promote the runtime check out if the checked value is loop
> invariant.

I think makes sense, at least for LICM, and is consistent with what James observed from the early run of the unroller. Please add a comment explaining why those passes are there. This file does not have many 'rationale' comments, and this is not a good thing. Why are you adding CVP? Can you please add some test cases (we normally don't add tests that runs the full pipeline, but for testing the pipeline, it is a good idea).

> 
> 
> 0002 - Introduce runtime unrolling disable matadata and use it to
> mark the scalar loop from vectorization. Runtime unrolling is an
> expensive optimization which can bring benefit only if the loop is
> hot and iteration number is relatively large enough. For some loops,
> we know they are not worth to be runtime unrolled. The scalar loop
> from vectorization is one of the cases.

I think this is a good idea. However, I think we might be overlooking something. If the purpose of the scalar loop is only to handle the 'left over' part of the iteration space that is not divisible by the vector length. However, if there are runtime safety checks, and those checks generally fail, then the loop could be hot. Can we exclude the case where we've emitted safety checks?

> 
> 
> 0003 - Enable partial & runtime unrolling on cortex-a57, and double
> the unrolling threshold if the loop depth > 1. For inner one of
> nested loops, it is more likely to be a hot loop, and the runtime
> check can be promoted out from patch 0001, so the overhead is less,
> we can try a larger threshold to unroll more loops.
> 

+  if (L->getLoopDepth() > 1)
+    UP.PartialThreshold *= 2;

Please add a comment here.

 -Hal

> 
> 
> 
> Combined above changes together, we can get below performance and
> code size changes.
> 
> 
> Benchmark Execution time code bloat
> 
> 
> spec.cpu2000.179_art -16.567% 8.805%
> spec.cpu2000.177_mesa -2.771% 1.912%
> spec.cpu2006.483_xalancbmk -2.555% 0.076%
> spec.cpu2000.256_bzip2 -1.648% 2.414%
> spec.cpu2006.433_milc -1.228% 1.353%
> spec.cpu2006.456_hmmer -1.079% 2.413%
> 
> spec.cpu2006.462_libquantum 2.492% 1.482%
> spec.cpu2000.253_perlbmk 1.563% 0.464%
> spec.cpu2006.450_soplex 1.379% 1.925%
> spec.cpu2000.186_crafty 1.242% 0.005%
> 
> spec.geomean -0.546% 0.952%
> 
> 
> Please review. Thanks.
> 
> 
> --
> 
> 
> Best Regards,
> 
> 
> Kevin Qin
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory