[llvm-bugs] [Bug 49204] New: miss optimization is actually slowing code from unoptmized state...

Tue Feb 16 01:48:15 PST 2021

https://bugs.llvm.org/show_bug.cgi?id=49204

            Bug ID: 49204
           Summary: miss optimization is actually slowing code from
                    unoptmized state...
           Product: tools
           Version: trunk
          Hardware: PC
                OS: All
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: llc
          Assignee: unassignedbugs at nondot.org
          Reporter: gordonbgood at gmail.com
                CC: llvm-bugs at lists.llvm.org

Created attachment 24537
  --> https://bugs.llvm.org/attachment.cgi?id=24537&action=edit
The output of `opt` that `llc` does not optimize correctly...

This is not a bug preventing use, but rather a miss-optimization that increases
run-time on the actual x86_64 CPU by probably about 33% up to 50%.

Note: The following was found on a PC with x86_64 architecture, OS-independent,
but could easily apply to any architecture on any machine with any OS...

The use case is applying extreme loop unrolling to optimize an extremely tight
loop for an implementation of the Sieve of Eratosthenes.  Without getting into
too many irrelevant details, one case of the unrolled loop is the following
snippet, with 31 other cases where the only differences are the initialization
of the pointer and integers and the order and sequence of the constants that
are or'ed across an array (the Sieve algorithm).  The following code can run on
a modern x86-64 CPU in as little as an average of just over one CPU clock cycle
per composite number cull (the or operation):

```
// pntr, inc, r1, r2, r3, r4, r5, r6, and r7 are initialized previously
// where ptr is a pointer to byte, rest are just integers...
while pntr < pntrlmt {
  or 1 with *ptnt
  or 2 with *pntr + r1
  or 4 with *pntr + r2
  or 8 with *pntr + r3
  or 16 with *pntr + r4
  or 32 with *pntr + r5
  or 64 with *pntr + r6
  or 128 with *pntr + r7
  pntr += inc
}
```

The actual code is generated from GHC Haskell using TemplateHaskell to
autogenerate all the 32 code cases of a variety of versions of GHC using a
variety of versions of LLVM by using the "-fllvm" GHC compiler flag to use the
LLVM back-end, but that would seem to be irrelevant to the problem.

The following is a snippet from the input file (as attached) into the llc
compiler in readable llvm form (an ".ll" file) (call it "test.ll" as generated
by the `opt` optimizer with given arguments, which is as expected (the
optmization level used doesn't seem to much matter as long as it is higher than
zero):

```
opt -O2 -enable-tbaa -tbaa '-relocation-model=pic' -S TestProgramGen.ll -o
Test.ll
```

```
cA2a:                                             ; preds = %czYi, %cA2a
  %lsyE8.052 = phi i64 [ %lnA69, %cA2a ], [ %lnA4y, %czYi ]
  %lnA4F = inttoptr i64 %lsyE8.052 to i8*
  %lnA4G = load i8, i8* %lnA4F, align 1, !tbaa !6
  %0 = or i8 %lnA4G, 1
  store i8 %0, i8* %lnA4F, align 1, !tbaa !6
  %lnA4N = add i64 %lsyE8.052, %lnA3t
  %lnA4R = inttoptr i64 %lnA4N to i8*
  %lnA4S = load i8, i8* %lnA4R, align 1, !tbaa !6
  %1 = or i8 %lnA4S, 2
  store i8 %1, i8* %lnA4R, align 1, !tbaa !6
  %lnA4Z = add i64 %lsyE8.052, %lnA3D
  %lnA53 = inttoptr i64 %lnA4Z to i8*
  %lnA54 = load i8, i8* %lnA53, align 1, !tbaa !6
  %2 = or i8 %lnA54, 4
  store i8 %2, i8* %lnA53, align 1, !tbaa !6
  %lnA5b = add i64 %lsyE8.052, %lnA3N
  %lnA5f = inttoptr i64 %lnA5b to i8*
  %lnA5g = load i8, i8* %lnA5f, align 1, !tbaa !6
  %3 = or i8 %lnA5g, 8
  store i8 %3, i8* %lnA5f, align 1, !tbaa !6
  %lnA5n = add i64 %lsyE8.052, %lnA3X
  %lnA5r = inttoptr i64 %lnA5n to i8*
  %lnA5s = load i8, i8* %lnA5r, align 1, !tbaa !6
  %4 = or i8 %lnA5s, 16
  store i8 %4, i8* %lnA5r, align 1, !tbaa !6
  %lnA5z = add i64 %lsyE8.052, %lnA47
  %lnA5D = inttoptr i64 %lnA5z to i8*
  %lnA5E = load i8, i8* %lnA5D, align 1, !tbaa !6
  %5 = or i8 %lnA5E, 32
  store i8 %5, i8* %lnA5D, align 1, !tbaa !6
  %lnA5L = add i64 %lsyE8.052, %lnA4h
  %lnA5P = inttoptr i64 %lnA5L to i8*
  %lnA5Q = load i8, i8* %lnA5P, align 1, !tbaa !6
  %6 = or i8 %lnA5Q, 64
  store i8 %6, i8* %lnA5P, align 1, !tbaa !6
  %lnA5X = add i64 %lsyE8.052, %lnA4r
  %lnA61 = inttoptr i64 %lnA5X to i8*
  %lnA62 = load i8, i8* %lnA61, align 1, !tbaa !6
  %7 = or i8 %lnA62, -128
  store i8 %7, i8* %lnA61, align 1, !tbaa !6
  %lnA69 = add i64 %lsyE8.052, %R2_Arg
  %lnA4B.not = icmp ult i64 %lnA69, %lnA4w
  br i1 %lnA4B.not, label %cA2a, label %cA2b

cA2b:                                             ; preds = %cA2a, %czYi
```
The compiler `llc` is run with the following options, but it doesn't seem to
matter what the optimization level is as long as it is greater than zero:

```
llc -O2 -enable-tbaa '-relocation-model=pic' '-mcpu=x86-64' '-mattr=+sse2,+sse'
test.bc -o test.s
```

The following is the appropriate equivalent snippet from the output assembly
file in AT&T format (as per "test.s"):

```
.LBB10_2:                               # %czzW
                                        # =>This Inner Loop Header: Depth=1
        orb     $1, (%rdi,%rax)
        orb     $2, (%rdi,%rcx)
        orb     $4, (%rdi,%rbx)
        orb     $8, (%rdi,%r8)
        orb     $16, (%rdi,%r9)
        orb     $32, (%rdi,%r13)
        orb     $64, (%rdi,%r10)
        orb     $-128, (%rdi,%rsi)
        addq    %r14, %rax
        leaq    (%rdi,%rax), %r15
        addq    %r14, %rcx
        addq    %r14, %rbx
        addq    %r14, %r8
        addq    %r14, %r9
        addq    %r14, %r13
        addq    %r14, %r10
        addq    %r14, %rsi
        cmpq    %r12, %r15
        jb      .LBB10_2
.LBB10_4:                               # %czzX
```

The expected assembly file output is as follows:

```
.LBB10_2:                               # %czzW
                                        # =>This Inner Loop Header: Depth=1
        orb     $1, (%rdi,%rax)
        orb     $2, (%rdi,%rcx)
        orb     $4, (%rdi,%rbx)
        orb     $8, (%rdi,%r8)
        orb     $16, (%rdi,%r9)
        orb     $32, (%rdi,%r13)
        orb     $64, (%rdi,%r10)
        orb     $-128, (%rdi,%rsi)
        addq    %r14, %rdi
        cmpq    %r12, %rdi
        jb      .LBB10_2
.LBB10_4:                               # %czzX
```

The compiler, `llc` correctly fused the read/modify/write operations into
single `or` instructions as it knew were available in the given architecture,
which is good; however, instead of leaving the general structure the same as
the source code, it used the common pointer register (rdi in this case) as a
constant and incremented each of the offset registers, as well using yet an
additional register (r15 in this case) to track the loop progress.  Although
this code is correct as to function, **the "optimization" is obviously wrong
and slower.**

When this same algorithm is written in C/C++ or a C-like language that
generates C/C++ is passed into Clang (which doesn't seem to use `llc` or passed
into `gcc`, the expected assembly code is generated and the loop runs about 33%
faster!  That is not an insignificant amount for the task at hand.

Tested with both LLVM version 9.0.1.5 on Fedora 33 and version 11.0.1 on the
same.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20210216/8c8ec52c/attachment.html>