[LLVMdev] Strange behaviour with x86-64 windows, bad call instruction address

Tue Feb 21 11:12:01 PST 2012

Hi all, me again!

Well, after much hacking of code and thinking and frustration, I finally figured out what I was doing wrong.  It turns out my initial attempts at using various gflags settings were causing VirtualAlloc to return GIANT addresses.  In particular, the Application Verifier flag ( -vrf ), seems to cause VirtualAlloc to do what looks like top-down allocations and then llvm happily starts using the addresses specified without checking to see if the next function stub address goes beyond the Windows 8 terabyte limit.  And why should it care really?  So, the lesson here is, DON'T use the Microsoft Application Verifier flag with anything that uses llvm 3.0, because if you are JIT'ing large amounts of IR, you'll end up with a bad address eventually, and in my case, immediately.  Guh.

.r.

Date: Tue, 14 Feb 2012 20:31:31 +0000
From: Robert Haskett <rhaskett at opentext.com>
Subject: [LLVMdev] Strange behaviour with x86-64 windows,	bad call
	instruction address
To: "llvmdev at cs.uiuc.edu" <llvmdev at cs.uiuc.edu>
Message-ID:
	<4895D06E53674F498EF4F6406C6436460A517E at otwlxg21.opentext.net>
Content-Type: text/plain; charset="us-ascii"

Hi all,

Some background: I'm working on a project to replace a custom VM with various components of llvm.  We have everything running just peachy keen with one recent exception, one of our executables crashes when attempting run a JIT'd function.  We have llvm building and running on 64 bit Windows and Linux, using Visual Studio 2008 on Windows and gcc on Linux, and we have the llvm static libs linked into one of a DLLs, which is then linked to several different EXE's.  The DLL contains the code to compile to llvm IR, JIT and run code written in our proprietary language.  Each EXE calls into this DLL the same way.  The same chunk of IR, when JIT'd in 3 of the EXE's runs perfectly, but in the last program, it dies in a call instruction out into an invalid memory location.  All compiler and linker options are the same for all 4 exe's.  The one difference I've seen when debugging the assembly is that the 3 that work all have JIT function pointer addresses less than a 32 bit value but t!
 he one that is failing has a 64 bit address, as indicated in the snippet below:

000007FFFFC511D7  pop         rbp
000007FFFFC511D8  ret
000007FFFFC511D9  sub         rsp,20h
000007FFFFC511DD  mov         rcx,qword ptr [rbp-70h]
000007FFFFC511E1  mov         edx,0FFFFFFFEh
000007FFFFC511E6  xor         r8d,r8d
000007FFFFC511E9  call        rsi
000007FFFFC511EB  add         rsp,20h
000007FFFFC511EF  test        al,1
000007FFFFC511F2  je          000007FFFFC511C3
000007FFFFC511F8  sub         rsp,20h
000007FFFFC511FC  mov         rax,7FFFFC30030h
000007FFFFC51206  mov         rcx,rdi
000007FFFFC51209  mov         edx,0FFFFFFFEh
000007FFFFC5120E  xor         r8d,r8d
000007FFFFC51211  call        rax
000007FFFFC51213  add         rsp,20h
000007FFFFC51217  test        al,1
000007FFFFC5121A  je          000007FFFFC511C3
000007FFFFC51220  mov         qword ptr [rbp-68h],rdi
000007FFFFC51224  mov         eax,10h
000007FFFFC51229  call        0000080077B3F1D0
000007FFFFC5122E  sub         rsp,rax
000007FFFFC51231  mov         rdx,rsp
000007FFFFC51234  mov         qword ptr [rbp-0F0h],rdx
000007FFFFC5123B  sub         rsp,20h

The call instruction at 000007FFFFC51229  is the one that jumps into invalid memory at 80077B3F1D0.  I'm not sure why this particular EXE causes llvm to use such large address values, but it looks like there might be some 32 bit vs 64 bit address calculation/offset problem when emitting the assembly.

The code that works looks like this:

0000000002931211  call        rax
0000000002931213  add         rsp,20h
0000000002931217  test        al,1
000000000293121A  je          00000000029311C3
0000000002931220  mov         qword ptr [rbp-68h],rdi
0000000002931224  mov         eax,10h
0000000002931229  call        0000000077B3F1D0
000000000293122E  sub         rsp,rax
0000000002931231  mov         rdx,rsp
0000000002931234  mov         qword ptr [rbp-0F0h],rdx
000000000293123B  sub         rsp,20h
000000000293123F  mov         r12,180071AD0h
0000000002931249  mov         ecx,0FFFFFFEEh
000000000293124E  xor         r8d,r8d
0000000002931251  mov         r9,29C02EAh
000000000293125B  call        r12
000000000293125E  add         rsp,20h
0000000002931262  mov         eax,10h
0000000002931267  call        0000000077B3F1D0
000000000293126C  sub         rsp,rax
000000000293126F  mov         rdx,rsp
0000000002931272  mov         qword ptr [rbp-58h],rdx
0000000002931276  sub         rsp,20h
000000000293127A  mov         ecx,0FFFFFFEEh
000000000293127F  xor         r8d,r8d
0000000002931282  mov         r9,29C02EAh
000000000293128C  call        r12
000000000293128F  add         rsp,20h
0000000002931293  mov         eax,10h
0000000002931298  call        0000000077B3F1D0
000000000293129D  sub         rsp,rax
00000000029312A0  mov         rax,rsp

And the code at 77B3F1D0 is this:

0000000077B3F1BE  nop
0000000077B3F1BF  nop
0000000077B3F1C0  int         3
0000000077B3F1C1  int         3
0000000077B3F1C2  int         3
0000000077B3F1C3  int         3
0000000077B3F1C4  int         3
0000000077B3F1C5  int         3
0000000077B3F1C6  nop         word ptr [rax+rax]
0000000077B3F1D0  sub         rsp,10h
0000000077B3F1D4  mov         qword ptr [rsp],r10
0000000077B3F1D8  mov         qword ptr [rsp+8],r11
0000000077B3F1DD  xor         r11,r11
0000000077B3F1E0  lea         r10,[rsp+18h]
0000000077B3F1E5  sub         r10,rax
0000000077B3F1E8  cmovb       r10,r11
0000000077B3F1EC  mov         r11,qword ptr gs:[10h]
0000000077B3F1F5  cmp         r10,r11
0000000077B3F1F8  jae         0000000077B3F210
0000000077B3F1FA  and         r10w,0F000h
0000000077B3F200  lea         r11,[r11-1000h]
0000000077B3F207  mov         byte ptr [r11],0
0000000077B3F20B  cmp         r10,r11
0000000077B3F20E  jne         0000000077B3F200
0000000077B3F210  mov         r10,qword ptr [rsp]
0000000077B3F214  mov         r11,qword ptr [rsp+8]
0000000077B3F219  add         rsp,10h
0000000077B3F21D  ret
0000000077B3F21E  nop
0000000077B3F21F  nop
0000000077B3F220  int         3
0000000077B3F221  int         3

  I searched the bug database for various topics but didn't see anything specific other than one mention in bug 5201 to do with 32 bit address truncating.  My dev system is a dual-core xeon with 16 gigs of ram.  I'm no expert in how llvm works to output the asm, but I'm not afraid to delve into it to see what's happening.  Has anyone else run into this?  Does anyone have a suggestion of where I might start to debug in the X86 emitter code?  I'm not even sure how to create a test case that would use a large starting address for the JIT?  Any help is greatly appreciated.

Thanks in advance,

.r.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.cs.uiuc.edu/pipermail/llvmdev/attachments/20120214/06679ec8/attachment-0001.html 

------------------------------

Message: 3
Date: Tue, 14 Feb 2012 23:51:57 +0100
From: Carl-Philip H?nsch <cphaensch at googlemail.com>
Subject: Re: [LLVMdev] Vectorization: Next Steps
To: Hal Finkel <hfinkel at anl.gov>
Cc: llvmdev at cs.uiuc.edu
Message-ID:
	<CAO_gjAVJBcN==XwfJBJ6UL+=pPxjRkPXHoLNuOBQDVQjUciD0A at mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

That works. Thank you.
Will -vectorize become default later?

2012/2/14 Hal Finkel <hfinkel at anl.gov>

> If you run with -vectorize instead of -bb-vectorize it will schedule the
> cleanup passes for you.
>
> -Hal
>
> *Sent from my Verizon Wireless Droid*
>
>
> -----Original message-----
>
> *From: *"Carl-Philip H?nsch" <cphaensch at googlemail.com>*
> To: *Hal Finkel <hfinkel at anl.gov>*
> Cc: *llvmdev at cs.uiuc.edu*
> Sent: *Tue, Feb 14, 2012 16:10:28 GMT+00:00
> *
> Subject: *Re: [LLVMdev] Vectorization: Next Steps
>
> I tested the "restricted" keyword and it works well :)
>
> The generated code is a bunch of shufflevector instructions, but after a
> second -O3 pass, everything looks fine.
> This problem is described in my ML post "passes propose passes" and occurs
> here again. LLVM has so much great passes, but they cannot start again when
> the code was somewhat simplified :(
> Maybe that's one more reason to tell the pass scheduler to redo some
> passes to find all optimizations. The core really simplifies to what I
> expected.
>
> 2012/2/13 Hal Finkel <hfinkel at anl.gov>
>
>> On Mon, 2012-02-13 at 11:11 +0100, Carl-Philip H?nsch wrote:
>> > I will test your suggestion, but I designed the test case to load the
>> > memory directly into <4 x float> registers. So there is absolutely no
>> > permutation and other swizzle or move operations. Maybe the heuristic
>> > should not only count the depth but also the surrounding load/store
>> > operations.
>>
>> I've attached two variants of your file, both which vectorize as you'd
>> expect. The core difference between these and your original file is that
>> I added the 'restrict' keyword so that the compiler can assume that the
>> arrays don't alias (or, in the first case, I made them globals). You
>> also probably need to specify some alignment information, otherwise the
>> memory operations will be scalarized in codegen.
>>
>>  -Hal
>>
>> >
>> > Are the load/store operations vectorized, too? (I designed the test
>> > case to completely fit the SSE registers)
>> >
>> > 2012/2/10 Hal Finkel <hfinkel at anl.gov>
>> >         Carl-Philip,
>> >
>> >         The reason that this does not vectorize is that it cannot
>> >         vectorize the
>> >         stores; this leaves only the mul-add chains (and some chains
>> >         with
>> >         loads), and they only have a depth of 2 (the threshold is 6).
>> >
>> >         If you give clang -mllvm -bb-vectorize-req-chain-depth=2 then
>> >         it will
>> >         vectorize. The reason the heuristic has such a large default
>> >         value is to
>> >         prevent cases where it costs more to permute all of the
>> >         necessary values
>> >         into and out of the vector registers than is saved by
>> >         vectorizing. Does
>> >         the code generated with -bb-vectorize-req-chain-depth=2 run
>> >         faster than
>> >         the unvectorized code?
>> >
>> >         The heuristic can certainly be improved, and these kinds of
>> >         test cases
>> >         are very important to that improvement process.
>> >
>> >          -Hal
>> >
>> >         On Thu, 2012-02-09 at 13:27 +0100, Carl-Philip H?nsch wrote:
>> >         > I have a super-simple test case 4x4 matrix * 4-vector which
>> >         gets
>> >         > correctly unrolled, but is not vectorized by -bb-vectorize.
>> >         (I used
>> >         > llvm 3.1svn)
>> >         > I attached the test case so you can see what is going wrong
>> >         there.
>> >         >
>> >         > 2012/2/3 Hal Finkel <hfinkel at anl.gov>
>> >         >         As some of you may know, I committed my basic-block
>> >         >         autovectorization
>> >         >         pass a few days ago. I encourage anyone interested
>> >         to try it
>> >         >         out (pass
>> >         >         -vectorize to opt or -mllvm -vectorize to clang) and
>> >         provide
>> >         >         feedback.
>> >         >         Especially in combination with
>> >         -unroll-allow-partial, I have
>> >         >         observed
>> >         >         some significant benchmark speedups, but, I have
>> >         also observed
>> >         >         some
>> >         >         significant slowdowns. I would like to share my
>> >         thoughts, and
>> >         >         hopefully
>> >         >         get feedback, on next steps.
>> >         >
>> >         >         1. "Target Data" for vectorization - I think that in
>> >         order to
>> >         >         improve
>> >         >         the vectorization quality, the vectorizer will need
>> >         more
>> >         >         information
>> >         >         about the target. This information could be provided
>> >         in the
>> >         >         form of a
>> >         >         kind of extended target data. This extended target
>> >         data might
>> >         >         contain:
>> >         >          - What basic types can be vectorized, and how many
>> >         of them
>> >         >         will fit
>> >         >         into (the largest) vector registers
>> >         >          - What classes of operations can be vectorized
>> >         (division,
>> >         >         conversions /
>> >         >         sign extension, etc. are not always supported)
>> >         >          - What alignment is necessary for loads and stores
>> >         >          - Is scalar-to-vector free?
>> >         >
>> >         >         2. Feedback between passes - We may to implement a
>> >         closer
>> >         >         coupling
>> >         >         between optimization passes than currently exists.
>> >         >         Specifically, I have
>> >         >         in mind two things:
>> >         >          - The vectorizer should communicate more closely
>> >         with the
>> >         >         loop
>> >         >         unroller. First, the loop unroller should try to
>> >         unroll to
>> >         >         preserve
>> >         >         maximal load/store alignments. Second, I think it
>> >         would make a
>> >         >         lot of
>> >         >         sense to be able to unroll and, only if this helps
>> >         >         vectorization should
>> >         >         the unrolled version be kept in preference to the
>> >         original.
>> >         >         With basic
>> >         >         block vectorization, it is often necessary to
>> >         (partially)
>> >         >         unroll in
>> >         >         order to vectorize. Even when we also have real loop
>> >         >         vectorization,
>> >         >         however, I still think that it will be important for
>> >         the loop
>> >         >         unroller
>> >         >         to communicate with the vectorizer.
>> >         >          - After vectorization, it would make sense for the
>> >         >         vectorization pass
>> >         >         to request further simplification, but only on those
>> >         parts of
>> >         >         the code
>> >         >         that it modified.
>> >         >
>> >         >         3. Loop vectorization - It would be nice to have, in
>> >         addition
>> >         >         to
>> >         >         basic-block vectorization, a more-traditional loop
>> >         >         vectorization pass. I
>> >         >         think that we'll need a better loop analysis pass in
>> >         order for
>> >         >         this to
>> >         >         happen. Some of this was started in
>> >         LoopDependenceAnalysis,
>> >         >         but that
>> >         >         pass is not yet finished. We'll need something like
>> >         this to
>> >         >         recognize
>> >         >         affine memory references, etc.
>> >         >
>> >         >         I look forward to hearing everyone's thoughts.
>> >         >
>> >         >          -Hal
>> >         >
>> >         >         --
>> >         >         Hal Finkel
>> >         >         Postdoctoral Appointee
>> >         >         Leadership Computing Facility
>> >         >         Argonne National Laboratory
>> >         >
>> >         >         _______________________________________________
>> >         >         LLVM Developers mailing list
>> >         >         LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> >         >         http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>> >         >
>> >
>> >         --
>> >         Hal Finkel
>> >         Postdoctoral Appointee
>> >         Leadership Computing Facility
>> >         Argonne National Laboratory
>> >
>> >
>> >
>>
>> --
>> Hal Finkel
>> Postdoctoral Appointee
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.cs.uiuc.edu/pipermail/llvmdev/attachments/20120214/29e445a0/attachment-0001.html 

------------------------------

Message: 4
Date: Tue, 14 Feb 2012 17:12:54 -0800
From: Welson Sun <welson.sun at gmail.com>
Subject: [LLVMdev] Wrong AliasAnalysis::getModRefInfo result
To: llvmdev at cs.uiuc.edu
Message-ID:
	<CAD3rk=0yOD693BPTGVGwTuWaSa8ckuZ=dg88RmQ3FFp_vw_m3A at mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

Just want to test out the LLVM's AliasAnalysis::getModRefInfo API. The
input C code is very simple:

void foo(int *a, int *b)
{
  for(int i=0; i<10; i++)
    b[i] = a[i]*a[i];
}

int main()
{
  int a[10];
  int b[10];

  for(int i=0; i<10; i++)
    a[i] = i;

  foo(a,b);

  return 0;
}

Obviously, for "foo", it only reads from array "a" and only writes to array
"b".

The LLVM pass:
    virtual bool runOnFunction(Function &F) {
      ++HelloCounter;
      errs() << "Hello: ";
      errs().write_escaped(F.getName()) << '\n';

      AliasAnalysis &AA = getAnalysis<AliasAnalysis>();
      for (inst_iterator I = inst_begin(F), E = inst_end(F); I != E; ++I) {
        Instruction *Inst = &*I;
        if ( CallInst *ci = dyn_cast<CallInst>(Inst) ){
          ci->dump();
          for(int i = 0; i < ci->getNumArgOperands(); i++){
            Value *v = ci->getArgOperand(i);
            if (GetElementPtrInst *vi = dyn_cast<GetElementPtrInst>(v)){
              Value *vPtr = vi->getPointerOperand();
              vPtr->dump();
              if ( AllocaInst *allo = dyn_cast<AllocaInst>(vPtr) ) {
                const Type *t = allo->getAllocatedType();
                if ( const ArrayType *at = dyn_cast<ArrayType>(t) ) {
                  int64_t size = at->getNumElements() *
at->getElementType()->getPrimitiveSizeInBits() / 8;
                  ImmutableCallSite cs(ci);
                  AliasAnalysis::Location loc(v, size);
                  errs() << AA.getModRefInfo(ci, loc) << "\n";
                }
              }
            }
          }
        }
      }

      return false;
    }

However, the result is "3" for both a and b, which is both read and write.
What's the problem? I am not quite sure if I get the
AliasAnalysis::Location right, what is exactly "address-units" for the size
of the location? And did I get the starting address of the Location right?
I tried v, vi and vPtr, same result.

Any insight helps,
Welson
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.cs.uiuc.edu/pipermail/llvmdev/attachments/20120214/fab3ba30/attachment-0001.html 

------------------------------

Message: 5
Date: Tue, 14 Feb 2012 18:33:34 -0800
From: Lang Hames <lhames at gmail.com>
Subject: Re: [LLVMdev] [llvm-commits] [PATCH] MachineRegisterInfo:
	Don't emit the same livein copy more than once
To: Tom Stellard <thomas.stellard at amd.com>
Cc: "Stellard, Thomas" <Tom.Stellard at amd.com>,	"llvmdev at cs.uiuc.edu"
	<llvmdev at cs.uiuc.edu>
Message-ID:
	<CALLttgr8Bxj_Ttbs=ez_qRCuS_xNwWjcrsu4SEbcsT4OyQf7nQ at mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

Hi Tom,

As far as I can tell EmitLiveInCopies is just there to handle physreg
arguments and return values. Is there any reason for these to change late
in your backend?

- Lang.

On Tue, Feb 14, 2012 at 7:22 AM, Tom Stellard <thomas.stellard at amd.com>wrote:

> On Mon, Feb 13, 2012 at 10:17:11PM -0800, Lang Hames wrote:
> > Hi Tom,
> >
> > I'm pretty sure this function should only ever be called once, by
> > SelectionDAG. Do you know where the second call is coming from in your
> code?
> >
> > Cheers,
> > Lang.
>
> Hi Lang,
>
> I was calling EmitLiveInCopies() from one of my backend specific passes.
> If the function can only be called once, then I'll just try to merge
> that pass with into the SelectionDAG.
>
> Thanks,
> Tom
>
> >
> > On Mon, Feb 13, 2012 at 7:03 PM, Stellard, Thomas <Tom.Stellard at amd.com
> >wrote:
> >
> > > This patch seems to have been lost on the llvm-commits mailing list.
> > >  Would someone be able to review it?
> > >
> > > Thanks,
> > > Tom
> > > ________________________________________
> > > From: llvm-commits-bounces at cs.uiuc.edu [
> llvm-commits-bounces at cs.uiuc.edu]
> > > on behalf of Tom Stellard [thomas.stellard at amd.com]
> > > Sent: Friday, February 03, 2012 1:55 PM
> > > To: llvm-commits at cs.uiuc.edu
> > > Subject: Re: [llvm-commits] [PATCH] MachineRegisterInfo: Don't emit the
> > > same livein copy more than once
> > >
> > > On Fri, Jan 27, 2012 at 02:56:03PM -0500, Tom Stellard wrote:
> > > > ---
> > > >
> > > > Is MachineRegisterInfo::EmitLiveInCopies() only meant to be called
> once
> > > > per compile?  If I call it more than once, it emits duplicate copies
> > > > which causes the live interval analysis to fail.
> > > >
> > > >  lib/CodeGen/MachineRegisterInfo.cpp |    4 +++-
> > > >  1 files changed, 3 insertions(+), 1 deletions(-)
> > > >
> > > > diff --git a/lib/CodeGen/MachineRegisterInfo.cpp
> > > b/lib/CodeGen/MachineRegisterInfo.cpp
> > > > index 266ebf6..fc787f2 100644
> > > > --- a/lib/CodeGen/MachineRegisterInfo.cpp
> > > > +++ b/lib/CodeGen/MachineRegisterInfo.cpp
> > > > @@ -227,7 +227,9 @@
> > > MachineRegisterInfo::EmitLiveInCopies(MachineBasicBlock *EntryMBB,
> > > >          // complicated by the debug info code for arguments.
> > > >          LiveIns.erase(LiveIns.begin() + i);
> > > >          --i; --e;
> > > > -      } else {
> > > > +        //Make sure we don't emit the same livein copies twice, in
> case
> > > this
> > > > +        //function is called more than once.
> > > > +      } else if (def_empty(LiveIns[i].second)) {
> > > >          // Emit a copy.
> > > >          BuildMI(*EntryMBB, EntryMBB->begin(), DebugLoc(),
> > > >                  TII.get(TargetOpcode::COPY), LiveIns[i].second)
> > > > --
> > > > 1.7.6.4
> > > >
> > > >
> > >
> > > Reposting this as a diff that can be applied via patch -P0 for SVN
> > > users.
> > >
> > > -Tom
> > >
> > >
> > > _______________________________________________
> > > LLVM Developers mailing list
> > > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> > >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.cs.uiuc.edu/pipermail/llvmdev/attachments/20120214/6eba3eed/attachment-0001.html 

------------------------------

Message: 6
Date: Tue, 14 Feb 2012 22:23:11 -0600
From: Hal Finkel <hfinkel at anl.gov>
Subject: Re: [LLVMdev] Vectorization: Next Steps
To: Carl-Philip H?nsch <cphaensch at googlemail.com>
Cc: llvmdev at cs.uiuc.edu
Message-ID: <1329279791.2835.4.camel at sapling2>
Content-Type: text/plain; charset="UTF-8"

On Tue, 2012-02-14 at 23:51 +0100, Carl-Philip H?nsch wrote:
> That works. Thank you.
> Will -vectorize become default later?

I don't know, but I think there is a lot of improvement to be made first.

 -Hal

> 
> 2012/2/14 Hal Finkel <hfinkel at anl.gov>
>         If you run with -vectorize instead of -bb-vectorize it will
>         schedule the cleanup passes for you.
>         
>         -Hal
>         
>         Sent from my Verizon Wireless Droid
>         
>         
>         -----Original message-----
>                 From: "Carl-Philip H?nsch" <cphaensch at googlemail.com>
>                 To: Hal Finkel <hfinkel at anl.gov>
>                 Cc: llvmdev at cs.uiuc.edu
>                 Sent: Tue, Feb 14, 2012 16:10:28 GMT+00:00
>                 
>                 Subject: Re: [LLVMdev] Vectorization: Next Steps
>                 
>                 
>                 I tested the "restricted" keyword and it works well :)
>                 
>                 The generated code is a bunch of shufflevector
>                 instructions, but after a second -O3 pass, everything
>                 looks fine.
>                 This problem is described in my ML post "passes
>                 propose passes" and occurs here again. LLVM has so
>                 much great passes, but they cannot start again when
>                 the code was somewhat simplified :(
>                 Maybe that's one more reason to tell the pass
>                 scheduler to redo some passes to find all
>                 optimizations. The core really simplifies to what I
>                 expected.
>                 
>                 2012/2/13 Hal Finkel <hfinkel at anl.gov>
>                         On Mon, 2012-02-13 at 11:11 +0100, Carl-Philip
>                         H?nsch wrote:
>                         > I will test your suggestion, but I designed
>                         the test case to load the
>                         > memory directly into <4 x float> registers.
>                         So there is absolutely no
>                         > permutation and other swizzle or move
>                         operations. Maybe the heuristic
>                         > should not only count the depth but also the
>                         surrounding load/store
>                         > operations.
>                         
>                         
>                         I've attached two variants of your file, both
>                         which vectorize as you'd
>                         expect. The core difference between these and
>                         your original file is that
>                         I added the 'restrict' keyword so that the
>                         compiler can assume that the
>                         arrays don't alias (or, in the first case, I
>                         made them globals). You
>                         also probably need to specify some alignment
>                         information, otherwise the
>                         memory operations will be scalarized in
>                         codegen.
>                         
>                          -Hal
>                         
>                         >
>                         > Are the load/store operations vectorized,
>                         too? (I designed the test
>                         > case to completely fit the SSE registers)
>                         >
>                         > 2012/2/10 Hal Finkel <hfinkel at anl.gov>
>                         >         Carl-Philip,
>                         >
>                         >         The reason that this does not
>                         vectorize is that it cannot
>                         >         vectorize the
>                         >         stores; this leaves only the mul-add
>                         chains (and some chains
>                         >         with
>                         >         loads), and they only have a depth
>                         of 2 (the threshold is 6).
>                         >
>                         >         If you give clang -mllvm
>                         -bb-vectorize-req-chain-depth=2 then
>                         >         it will
>                         >         vectorize. The reason the heuristic
>                         has such a large default
>                         >         value is to
>                         >         prevent cases where it costs more to
>                         permute all of the
>                         >         necessary values
>                         >         into and out of the vector registers
>                         than is saved by
>                         >         vectorizing. Does
>                         >         the code generated with
>                         -bb-vectorize-req-chain-depth=2 run
>                         >         faster than
>                         >         the unvectorized code?
>                         >
>                         >         The heuristic can certainly be
>                         improved, and these kinds of
>                         >         test cases
>                         >         are very important to that
>                         improvement process.
>                         >
>                         >          -Hal
>                         >
>                         >         On Thu, 2012-02-09 at 13:27 +0100,
>                         Carl-Philip H?nsch wrote:
>                         >         > I have a super-simple test case
>                         4x4 matrix * 4-vector which
>                         >         gets
>                         >         > correctly unrolled, but is not
>                         vectorized by -bb-vectorize.
>                         >         (I used
>                         >         > llvm 3.1svn)
>                         >         > I attached the test case so you
>                         can see what is going wrong
>                         >         there.
>                         >         >
>                         >         > 2012/2/3 Hal Finkel
>                         <hfinkel at anl.gov>
>                         >         >         As some of you may know, I
>                         committed my basic-block
>                         >         >         autovectorization
>                         >         >         pass a few days ago. I
>                         encourage anyone interested
>                         >         to try it
>                         >         >         out (pass
>                         >         >         -vectorize to opt or
>                         -mllvm -vectorize to clang) and
>                         >         provide
>                         >         >         feedback.
>                         >         >         Especially in combination
>                         with
>                         >         -unroll-allow-partial, I have
>                         >         >         observed
>                         >         >         some significant benchmark
>                         speedups, but, I have
>                         >         also observed
>                         >         >         some
>                         >         >         significant slowdowns. I
>                         would like to share my
>                         >         thoughts, and
>                         >         >         hopefully
>                         >         >         get feedback, on next
>                         steps.
>                         >         >
>                         >         >         1. "Target Data" for
>                         vectorization - I think that in
>                         >         order to
>                         >         >         improve
>                         >         >         the vectorization quality,
>                         the vectorizer will need
>                         >         more
>                         >         >         information
>                         >         >         about the target. This
>                         information could be provided
>                         >         in the
>                         >         >         form of a
>                         >         >         kind of extended target
>                         data. This extended target
>                         >         data might
>                         >         >         contain:
>                         >         >          - What basic types can be
>                         vectorized, and how many
>                         >         of them
>                         >         >         will fit
>                         >         >         into (the largest) vector
>                         registers
>                         >         >          - What classes of
>                         operations can be vectorized
>                         >         (division,
>                         >         >         conversions /
>                         >         >         sign extension, etc. are
>                         not always supported)
>                         >         >          - What alignment is
>                         necessary for loads and stores
>                         >         >          - Is scalar-to-vector
>                         free?
>                         >         >
>                         >         >         2. Feedback between passes
>                         - We may to implement a
>                         >         closer
>                         >         >         coupling
>                         >         >         between optimization
>                         passes than currently exists.
>                         >         >         Specifically, I have
>                         >         >         in mind two things:
>                         >         >          - The vectorizer should
>                         communicate more closely
>                         >         with the
>                         >         >         loop
>                         >         >         unroller. First, the loop
>                         unroller should try to
>                         >         unroll to
>                         >         >         preserve
>                         >         >         maximal load/store
>                         alignments. Second, I think it
>                         >         would make a
>                         >         >         lot of
>                         >         >         sense to be able to unroll
>                         and, only if this helps
>                         >         >         vectorization should
>                         >         >         the unrolled version be
>                         kept in preference to the
>                         >         original.
>                         >         >         With basic
>                         >         >         block vectorization, it is
>                         often necessary to
>                         >         (partially)
>                         >         >         unroll in
>                         >         >         order to vectorize. Even
>                         when we also have real loop
>                         >         >         vectorization,
>                         >         >         however, I still think
>                         that it will be important for
>                         >         the loop
>                         >         >         unroller
>                         >         >         to communicate with the
>                         vectorizer.
>                         >         >          - After vectorization, it
>                         would make sense for the
>                         >         >         vectorization pass
>                         >         >         to request further
>                         simplification, but only on those
>                         >         parts of
>                         >         >         the code
>                         >         >         that it modified.
>                         >         >
>                         >         >         3. Loop vectorization - It
>                         would be nice to have, in
>                         >         addition
>                         >         >         to
>                         >         >         basic-block vectorization,
>                         a more-traditional loop
>                         >         >         vectorization pass. I
>                         >         >         think that we'll need a
>                         better loop analysis pass in
>                         >         order for
>                         >         >         this to
>                         >         >         happen. Some of this was
>                         started in
>                         >         LoopDependenceAnalysis,
>                         >         >         but that
>                         >         >         pass is not yet finished.
>                         We'll need something like
>                         >         this to
>                         >         >         recognize
>                         >         >         affine memory references,
>                         etc.
>                         >         >
>                         >         >         I look forward to hearing
>                         everyone's thoughts.
>                         >         >
>                         >         >          -Hal
>                         >         >
>                         >         >         --
>                         >         >         Hal Finkel
>                         >         >         Postdoctoral Appointee
>                         >         >         Leadership Computing
>                         Facility
>                         >         >         Argonne National
>                         Laboratory
>                         >         >
>                         >         >
>                         _______________________________________________
>                         >         >         LLVM Developers mailing
>                         list
>                         >         >         LLVMdev at cs.uiuc.edu
>                         http://llvm.cs.uiuc.edu
>                         >         >
>                         http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>                         >         >
>                         >
>                         >         --
>                         >         Hal Finkel
>                         >         Postdoctoral Appointee
>                         >         Leadership Computing Facility
>                         >         Argonne National Laboratory
>                         >
>                         >
>                         >
>                         
>                         --
>                         Hal Finkel
>                         Postdoctoral Appointee
>                         Leadership Computing Facility
>                         Argonne National Laboratory
>                         
>                 
>                 
> 

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory
1-630-252-0023
hfinkel at anl.gov

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

------------------------------

_______________________________________________
LLVMdev mailing list
LLVMdev at cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

End of LLVMdev Digest, Vol 92, Issue 30
***************************************