[PATCH] D41953: Unroll and Jam

Thu Jan 11 10:18:35 PST 2018

dmgreen created this revision.
Herald added subscribers: kosarev, eraman, javed.absar, mgorny, mehdi_amini.

This is a implementation of unroll and jam, which is something that comes up as useful for our smaller embedded processors. It is getting to the point where, although more work is needed in places, I have questions about how this should be done best.

The basic idea is that we take an outer loop of the for:
for i..

  ForeBlocks(i)
  for j..
    SubLoopBlocks(i, j)
  AftBlocks(i)

Instead of doing normal inner or outer unrolling, we unroll as follows:
for i... i+=2

  ForeBlocks(i)
  ForeBlocks(i+1)
  for j..
    SubLoopBlocks(i, j)
    SubLoopBlocks(i+1, j)
  AftBlocks(i)
  AftBlocks(i+1)

Remainder

To do this we need to ensure that the ForeBlocks(i+1) can be moved before the SubLoopBlocks(i) and AftBlocks(i), which means potentially moving the phi node operands from AftBlocks into Fore. There is also memory dependency checks and other safety checks that are needed. The transform is then a fairly simple job of using the excellent existing unroll code for cloning blocks and gluing them all back together correctly.

OK, but there are some things that still need sorting out:

- The dependency analysis is built upon DependencyAnalysis that loop interchange uses. This might have been a mistake. I had to make some changes to DA to ensure that the AA gave correct results (and enable TBAA). I'm not 100% sure if they are correct, but will obviously be pulled out into separate patches. My understanding is that DA may still broken in some way though. Is there a better way I should be going here? I have not managed to break this with csmith, but there may be something going on that wouldn't be tested with that.

- This currently is kind of bolted onto the side of the LoopUnroller. It should either be integrated a little better or become it's own pass. I have no strong opinion either way.

- The performance heuristics might not be sorted correctly yet. Several parts might be over-conservative when disabling for safety. The remainder loop may not be the best, I'm not sure how it will play with vectorisers, etc. It is hopefully a good start.

- I have tested this from the C level (i.e. csmith) but not a lot from the IR level with some form of fuzzer.

- I have not done anything with pragma's yet.

https://reviews.llvm.org/D41953

Files:
  include/llvm/Analysis/TargetTransformInfo.h
  include/llvm/Transforms/Utils/UnrollLoop.h
  lib/Analysis/DependenceAnalysis.cpp
  lib/Target/ARM/ARMTargetTransformInfo.cpp
  lib/Transforms/Scalar/LICM.cpp
  lib/Transforms/Scalar/LoopUnrollPass.cpp
  lib/Transforms/Utils/CMakeLists.txt
  lib/Transforms/Utils/LoopUnroll.cpp
  lib/Transforms/Utils/LoopUnrollAndJam.cpp
  lib/Transforms/Utils/LoopUtils.cpp
  test/Other/new-pm-defaults.ll
  test/Other/new-pm-thinlto-defaults.ll
  test/Other/pass-pipelines.ll
  test/Transforms/LoopUnroll/unroll-and-jam-disabled.ll
  test/Transforms/LoopUnroll/unroll-and-jam-unprofitable.ll
  test/Transforms/LoopUnroll/unroll-and-jam.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D41953.129294.patch
Type: text/x-patch
Size: 122172 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20180111/8f86c8f1/attachment-0001.bin>