<html><head><meta http-equiv="Content-Type" content="text/html; charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><br class=""><div><br class=""><blockquote type="cite" class=""><div class="">On May 23, 2020, at 17:15, legend xx <<a href="mailto:legendaryxx7slh@gmail.com" class="">legendaryxx7slh@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class=""><div class="">This is my example (for.c):</div><div class=""><br class=""></div><div class="">#include <stdio.h><br class=""><br class="">int add(int a, int b) {<br class=""> return a + b;<br class="">}<br class=""><br class="">int main() {<br class=""> int a, b, c, d;<br class=""> a = 5;<br class=""> b = 15;<br class=""> c = add(a, b);<br class=""> d = 0;<br class=""> for(int i=0;i<16;i++)<br class=""> d = add(c, d);<br class="">}</div><div class=""><br class=""></div><div class="">I run:</div><div class="">$ clang -O0 -Xclang -disable-O0-optnone -emit-llvm for.c -S -o forO0.ll<br class="">$ opt -O0 -S --loop-unroll --unroll-count=4 -view-cfg forO0.ll -o for-opt00-unroll4.ll</div><div class=""><br class=""></div><div class="">And this is the LLVM IR code that I get: <br class=""></div><div class=""><br class=""></div><div class="">; ModuleID = 'forO0.ll'<br class="">source_filename = "for.c"<br class="">target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"<br class="">target triple = "x86_64-unknown-linux-gnu"<br class=""><br class="">; Function Attrs: noinline nounwind uwtable<br class="">define dso_local i32 @add(i32 %a, i32 %b) #0 {<br class="">entry:<br class=""> %a.addr = alloca i32, align 4<br class=""> %b.addr = alloca i32, align 4<br class=""> store i32 %a, i32* %a.addr, align 4<br class=""> store i32 %b, i32* %b.addr, align 4<br class=""> %0 = load i32, i32* %a.addr, align 4<br class=""> %1 = load i32, i32* %b.addr, align 4<br class=""> %add = add nsw i32 %0, %1<br class=""> ret i32 %add<br class="">}<br class=""><br class="">; Function Attrs: noinline nounwind uwtable<br class="">define dso_local i32 @main() #0 {<br class="">entry:<br class=""> %retval = alloca i32, align 4<br class=""> %a = alloca i32, align 4<br class=""> %b = alloca i32, align 4<br class=""> %c = alloca i32, align 4<br class=""> %d = alloca i32, align 4<br class=""> %i = alloca i32, align 4<br class=""> store i32 0, i32* %retval, align 4<br class=""> store i32 5, i32* %a, align 4<br class=""> store i32 15, i32* %b, align 4<br class=""> %0 = load i32, i32* %a, align 4<br class=""> %1 = load i32, i32* %b, align 4<br class=""> %call = call i32 @add(i32 %0, i32 %1)<br class=""> store i32 %call, i32* %c, align 4<br class=""> store i32 0, i32* %d, align 4<br class=""> store i32 0, i32* %i, align 4<br class=""> br label %for.cond<br class=""><br class="">for.cond: ; preds = %for.inc.3, %entry<br class=""> %2 = load i32, i32* %i, align 4<br class=""> %cmp = icmp slt i32 %2, 16<br class=""> br i1 %cmp, label %for.body, label %for.end<br class=""><br class="">for.body: ; preds = %for.cond<br class=""> %3 = load i32, i32* %c, align 4<br class=""> %4 = load i32, i32* %d, align 4<br class=""> %call1 = call i32 @add(i32 %3, i32 %4)<br class=""> store i32 %call1, i32* %d, align 4<br class=""> br label %for.inc<br class=""><br class="">for.inc: ; preds = %for.body<br class=""> %5 = load i32, i32* %i, align 4<br class=""> %inc = add nsw i32 %5, 1<br class=""> store i32 %inc, i32* %i, align 4<br class=""> %6 = load i32, i32* %i, align 4<br class=""> %cmp.1 = icmp slt i32 %6, 16<br class=""> br i1 %cmp.1, label %for.body.1, label %for.end<br class=""><br class="">for.end: ; preds = %for.inc.2, %for.inc.1, %for.inc, %for.cond<br class=""> %7 = load i32, i32* %d, align 4<br class=""> %call2 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([20 x i8], [20 x i8]* @.str, i64 0, i64 0), i32 %7)<br class=""> %8 = load i32, i32* %retval, align 4<br class=""> ret i32 %8<br class=""><br class="">for.body.1: ; preds = %for.inc<br class=""> %9 = load i32, i32* %c, align 4<br class=""> %10 = load i32, i32* %d, align 4<br class=""> %call1.1 = call i32 @add(i32 %9, i32 %10)<br class=""> store i32 %call1.1, i32* %d, align 4<br class=""> br label %for.inc.1<br class=""><br class="">for.inc.1: ; preds = %for.body.1<br class=""> %11 = load i32, i32* %i, align 4<br class=""> %inc.1 = add nsw i32 %11, 1<br class=""> store i32 %inc.1, i32* %i, align 4<br class=""> %12 = load i32, i32* %i, align 4<br class=""> %cmp.2 = icmp slt i32 %12, 16<br class=""> br i1 %cmp.2, label %for.body.2, label %for.end<br class=""><br class="">for.body.2: ; preds = %for.inc.1<br class=""> %13 = load i32, i32* %c, align 4<br class=""> %14 = load i32, i32* %d, align 4<br class=""> %call1.2 = call i32 @add(i32 %13, i32 %14)<br class=""> store i32 %call1.2, i32* %d, align 4<br class=""> br label %for.inc.2<br class=""><br class="">for.inc.2: ; preds = %for.body.2<br class=""> %15 = load i32, i32* %i, align 4<br class=""> %inc.2 = add nsw i32 %15, 1<br class=""> store i32 %inc.2, i32* %i, align 4<br class=""> %16 = load i32, i32* %i, align 4<br class=""> %cmp.3 = icmp slt i32 %16, 16<br class=""> br i1 %cmp.3, label %for.body.3, label %for.end<br class=""><br class="">for.body.3: ; preds = %for.inc.2<br class=""> %17 = load i32, i32* %c, align 4<br class=""> %18 = load i32, i32* %d, align 4<br class=""> %call1.3 = call i32 @add(i32 %17, i32 %18)<br class=""> store i32 %call1.3, i32* %d, align 4<br class=""> br label %for.inc.3<br class=""><br class="">for.inc.3: ; preds = %for.body.3<br class=""> %19 = load i32, i32* %i, align 4<br class=""> %inc.3 = add nsw i32 %19, 1<br class=""> store i32 %inc.3, i32* %i, align 4<br class=""> br label %for.cond, !llvm.loop !2<br class="">}<br class=""><br class="">declare dso_local i32 @printf(i8*, ...) #1<br class=""><br class="">attributes #0 = { noinline nounwind uwtable "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "frame-pointer"="all" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "unsafe-fp-math"="false" "use-soft-float"="false" }<br class="">attributes #1 = { "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "frame-pointer"="all" "less-precise-fpmad"="false" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "unsafe-fp-math"="false" "use-soft-float"="false" }<br class=""><br class="">!llvm.module.flags = !{!0}<br class="">!llvm.ident = !{!1}<br class=""><br class="">!0 = !{i32 1, !"wchar_size", i32 4}<br class="">!1 = !{!"clang version 11.0.0 (<a href="https://github.com/llvm/llvm-project.git" class="">https://github.com/llvm/llvm-project.git</a> a3485301d4870f57590d7b69eed7959134a694ab)"}<br class="">!2 = distinct !{!2, !3}<br class="">!3 = !{!"llvm.loop.unroll.disable"}</div><div class=""><br class=""></div><div class=""><br class=""></div><div class="">So my problem is:</div><div class="">With unroll 4 on the loop with 16 bounds I should see one single
block for the incrementation i=i+4, then 4 instructions for each
previous one instruction, and the condition should check if i<16. This is the intuitive code. However, the incrementation that I get is i=i+1 and there are only 4 blocks. <br class="">
</div><div class=""> </div><div class=""><br class=""></div><div class="">Do you know why this happen?</div></div></div></blockquote><br class=""></div><div>I think loop-unroll works as expected in your example, as you can see the copies of the unrolled loop blocks (for.body.X, for.inc.X). The reason this is not simplified to the single block you are expecting is the input for -loop-unroll: -loop-unroll gets the IR without any optimizations (-O0). </div><div><br class=""></div><div>For the expected result, you need to run a few additional passes before -loop-unroll to promote some of the loads/stores to registers and simplify the CFG of the input. Running `opt -mem2reg -simplifycfg -loop-unroll -unroll-count=4 forO0.ll -S` should give you something like</div><div><br class=""></div><div><div>define i32 @main() #0 {</div><div>entry:</div><div> %call = call i32 @add(i32 5, i32 15)</div><div> br label %for.cond</div><div><br class=""></div><div>for.cond: ; preds = %for.body.3, %entry</div><div> %d.0 = phi i32 [ 0, %entry ], [ %call1.3, %for.body.3 ]</div><div> %i.0 = phi i32 [ 0, %entry ], [ %inc.3, %for.body.3 ]</div><div> %cmp = icmp ult i32 %i.0, 16</div><div> br i1 %cmp, label %for.body, label %for.end</div><div><br class=""></div><div>for.body: ; preds = %for.cond</div><div> %call1 = call i32 @add(i32 %call, i32 %d.0)</div><div> %inc = add nuw nsw i32 %i.0, 1</div><div> br label %for.body.1</div><div><br class=""></div><div>for.end: ; preds = %for.cond</div><div> ret i32 0</div><div><br class=""></div><div>for.body.1: ; preds = %for.body</div><div> %call1.1 = call i32 @add(i32 %call, i32 %call1)</div><div> %inc.1 = add nuw nsw i32 %inc, 1</div><div> br label %for.body.2</div><div><br class=""></div><div>for.body.2: ; preds = %for.body.1</div><div> %call1.2 = call i32 @add(i32 %call, i32 %call1.1)</div><div> %inc.2 = add nuw nsw i32 %inc.1, 1</div><div> br label %for.body.3</div><div><br class=""></div><div>for.body.3: ; preds = %for.body.2</div><div> %call1.3 = call i32 @add(i32 %call, i32 %call1.2)</div><div> %inc.3 = add nuw nsw i32 %inc.2, 1</div><div> br label %for.cond, !llvm.loop !4</div><div>}</div><div><br class=""></div><div>Note that there are still 4 copies of the body instead of a single one. Like many passes in LLVM, the loop-unroll pass focuses on performing one transformation (duplicating the loop body a number of times) and relies on other passes to clean-up/simplify the result. To fold the 4 copies of the body into a single block, you need another round of CFG simplifications. Running `opt -mem2reg -simplifycfg -loop-unroll -unroll-count=4 -simplifycfg forO0.ll -S` produces the code below, which is what you are looking for IIUC.</div><div><br class=""></div><div><div>define i32 @main() #0 {</div><div>entry:</div><div> %call = call i32 @add(i32 5, i32 15)</div><div> br label %for.cond</div><div><br class=""></div><div>for.cond: ; preds = %for.body, %entry</div><div> %d.0 = phi i32 [ 0, %entry ], [ %call1.3, %for.body ]</div><div> %i.0 = phi i32 [ 0, %entry ], [ %inc.3, %for.body ]</div><div> %cmp = icmp ult i32 %i.0, 16</div><div> br i1 %cmp, label %for.body, label %for.end</div><div><br class=""></div><div>for.body: ; preds = %for.cond</div><div> %call1 = call i32 @add(i32 %call, i32 %d.0)</div><div> %inc = add nuw nsw i32 %i.0, 1</div><div> %call1.1 = call i32 @add(i32 %call, i32 %call1)</div><div> %inc.1 = add nuw nsw i32 %inc, 1</div><div> %call1.2 = call i32 @add(i32 %call, i32 %call1.1)</div><div> %inc.2 = add nuw nsw i32 %inc.1, 1</div><div> %call1.3 = call i32 @add(i32 %call, i32 %call1.2)</div><div> %inc.3 = add nuw nsw i32 %inc.2, 1</div><div> br label %for.cond, !llvm.loop !4</div><div><br class=""></div><div>for.end: ; preds = %for.cond</div><div> ret i32 0</div><div>}</div></div></div><br class=""></body></html>