[LLVMdev] Loop Unroll

Sun Nov 30 21:32:06 PST 2014

On 11/29/2014 6:52 AM, rcieszew wrote:
> Hello,
> I would like to create VHDL backend for LLVM and now i'm testing 
> unroll loop passes. I would like to unroll loop but to parallel form 
> (each basic block of unrolled loop has the same parent node). Now i 
> can only unrool loop to serial form (each basic block is a parent node 
> of another).
> It is possible to unroll loop to parallel form (each basic block of 
> onrolled loop has the same parent node in CFG)?

Hello Radoslaw,

As far as I can make out, there is a mismatch between the VHDL-level 
picture that you have in mind, and the way a traditional CPU compiler 
works. Here, "unroll" simply means "serialize". A basic block with 
multiple successors in LLVM has a conditional branch that transfers 
control to only one of all the successors. What you have in mind is a 
way to transfer control to all successors in parallel. This cannot be 
represented in LLVM IR.

The implicit assumption is that there are no dependencies in the loop of 
interest, and all iterations can be executed in parallel. There can be 
several ways to handle this:

 1. Merge all the unrolled basic blocks into one block. Then maybe the
    instruction-level parallelism between them will automatically show
    up in your VHDL. This is the simplest way to do it.
 2. Vectorize the loop body in LLVM, then generate VHDL entities that
    can handle vector inputs. This will be limited by the size of
    vectors that the LLVM vectorizer can generate. Also your memory
    subsystem will need to handle vector load/stores.
 3. This last one is purely in the VHDL generator: Somehow mark loops
    that can be parallelized, and generate a custom VHDL entity that
    captures the loop body. Then instead of generating a loop control
    structure in your VHDL, generate a fork/join structure that
    transfers control to multiple instances of your entity, one for each
    iteration of the loop. This will be limited by the number of
    load/store requests that your memory subsystem can accept in parallel.

Sameer.