[LLVMdev] LLVMdev Digest, Vol 120, Issue 71

Thu Jun 26 11:23:51 PDT 2014

Thank you very much Roger

Regards,

-David

-----Original Message-----
From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On
Behalf Of llvmdev-request at cs.uiuc.edu
Sent: 26 June 2014 18:43
To: llvmdev at cs.uiuc.edu
Subject: LLVMdev Digest, Vol 120, Issue 71

Send LLVMdev mailing list submissions to
	llvmdev at cs.uiuc.edu

To subscribe or unsubscribe via the World Wide Web, visit
	http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
or, via email, send a message with subject or body 'help' to
	llvmdev-request at cs.uiuc.edu

You can reach the person managing the list at
	llvmdev-owner at cs.uiuc.edu

When replying, please edit your Subject line so it is more specific than
"Re: Contents of LLVMdev digest..."

Today's Topics:

   1. Re: Phabricator and private reviews (John Criswell)
   2. Re: Python version requirement for LLVM (Eli Bendersky)
   3. Re: eraseFromParent and stack dump (John Criswell)
   4. Re: eraseFromParent and stack dump (Will Dietz)
   5. -gcolumn-info and PR 14106 (Diego Novillo)
   6. Re: Contributing the Apple ARM64 compiler backend (Sanjay Patel)
   7. Re: Contributing the Apple ARM64 compiler backend (James Molloy)
   8. Re: Contributing the Apple ARM64 compiler backend (Sanjay Patel)

----------------------------------------------------------------------

Message: 1
Date: Thu, 26 Jun 2014 10:09:47 -0500
From: John Criswell <criswell at illinois.edu>
To: Manuel Klimek <klimek at google.com>
Cc: LLVM Developers Mailing List <llvmdev at cs.uiuc.edu>
Subject: Re: [LLVMdev] Phabricator and private reviews
Message-ID: <53AC37BB.70102 at illinois.edu>
Content-Type: text/plain; charset="utf-8"; Format="flowed"

On 6/26/14, 4:40 AM, Manuel Klimek wrote:
> On Thu, Jun 26, 2014 at 12:30 AM, John Criswell <criswell at illinois.edu 
> <mailto:criswell at illinois.edu>> wrote:
>
>     On 6/25/14, 5:15 PM, Vadim Chugunov wrote:
>>     In a recent review via Phabricator, I was receiving bounce
>>     notifications for mail being sent to llvm-commits because of "Too
>>     many recipients to the message", even though I am a subscriber. 
>>     I wonder how common is that.
>
>     Someone else emailed about that to me earlier today.
>
>     The current limit is set at 10 for llvm-commits.  It sounds like
>     that is too low.
>
>
> Wait, is that set on llvm-commits, or is this related to phab?

To clarify, this is a Mailman setting for the llvm-commits list.  As far as
I know, it's the default setting which I've just increased to a value of 20.

Regards,

John Criswell

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.cs.uiuc.edu/pipermail/llvmdev/attachments/20140626/f324bab2/at
tachment-0001.html>

------------------------------

Message: 2
Date: Thu, 26 Jun 2014 08:22:42 -0700
From: Eli Bendersky <eliben at google.com>
To: Gregory Szorc <gregory.szorc at gmail.com>
Cc: "llvmdev at cs.uiuc.edu Mailing List" <llvmdev at cs.uiuc.edu>
Subject: Re: [LLVMdev] Python version requirement for LLVM
Message-ID:
	<CACLQwhHGtjMRpa=ft_CNFKu-qep81giL=_X1xJ3psg5Hy8B+BA at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

On Wed, Jun 25, 2014 at 5:52 PM, Gregory Szorc <gregory.szorc at gmail.com>
wrote:

> As much as I like killing support for Python 2.6 and below, RHEL is 
> usually the blocker. They still have 2.4 under support. Only the RHEL 
> that was released a few weeks ago finally has 2.7.
>

Given the amount of complexity required to build LLVM & Clang (having the
right compiler & libstdc++ installed), compared to the 3 minutes it
typically takes to install any Python version on any Linux box, these
limitations always strike me as silly. But I gave up on this fight some time
ago.

Eli

>
> On Jun 25, 2014, at 17:11, Alexander Kornienko <alexfh at google.com> wrote:
>
> http://llvm.org/docs/GettingStarted.html currently mentions Python 2.5 
> as a minimum required version. I'd like to use argparse 
> <https://docs.python.org/dev/library/argparse.html> in a script and be 
> able to test this script. This requires Python 2.7. This version has 
> been around since 2010, and afaiu, is available on all modern 
> platforms. Is there any reason not to change minimum required version of
Python to 2.7?
>
> --
> Regards,
> Alexander Kornienko
>
> _______________________________________________
>
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.cs.uiuc.edu/pipermail/llvmdev/attachments/20140626/131caf66/at
tachment-0001.html>

------------------------------

Message: 3
Date: Thu, 26 Jun 2014 10:29:39 -0500
From: John Criswell <criswell at illinois.edu>
To: Vasileios Koutsoumpos <bill_koutsoumpos at hotmail.com>, LLVM Dev
	<llvmdev at cs.uiuc.edu>
Subject: Re: [LLVMdev] eraseFromParent and stack dump
Message-ID: <53AC3C63.6030500 at illinois.edu>
Content-Type: text/plain; charset="ISO-8859-1"; format=flowed

On 6/26/14, 4:38 AM, Vasileios Koutsoumpos wrote:
> Hello,
>
> I am creating a new instruction and I want to replace the use of a 
> specified instruction.
> This is the code I have written
>
> Instruction *new_instr = BinaryOperator::Create(Instruction::Sub, op1, 
> op2, "");

As a matter of style, I'd use the version of BinaryOperator::Create() that
takes an Instruction * as an insert point and use it to insert the new
instruction *before* the old instruction.  It shouldn't matter whether it
goes before or after, and it avoids having to do a separate operation to
insert the BinaryOperator into the basic block.

> b->getInstList().insertAfter(old_instr, new_instr);  //b is the
> BasicBlock
> old_instr->replaceAllUsesWith(new_instr);
> old_instr->eraseFromParent();
>
> When I print the basic block, I see that my instruction was inserted 
> and replaces the old instruction, however I get a stack dump, when I 
> run the instrumented IR file.
>
> 0  opt             0x00000000014680df 
> llvm::sys::PrintStackTrace(_IO_FILE*) + 38
> 1  opt             0x000000000146835c
> 2  opt             0x0000000001467dd8
> 3  libpthread.so.0 0x00007f8934eecbb0
> 4  llfi-passes.so  0x00007f89340a39b2 llvm::Value::getValueID() const
> + 12
> 5  llfi-passes.so  0x00007f89340a3aa4 llvm::Instruction::getOpcode() 
> const + 24
> 6  llfi-passes.so  0x00007f89340a331c 
> llfi::FaultInjectionPass::locate_instruction(llvm::Module&,
> llvm::Function*, llvm::LoopInfo&, int) + 282
> 7  llfi-passes.so  0x00007f89340a2e50 
> llfi::FaultInjectionPass::finalize(llvm::Module&, int) + 506
> 8  llfi-passes.so  0x00007f89340a3509
> llfi::FaultInjectionPass::runOnModule(llvm::Module&) + 113
> 9  opt             0x0000000001362a1e 
> llvm::MPPassManager::runOnModule(llvm::Module&) + 502
> 10 opt             0x0000000001362fee 
> llvm::PassManagerImpl::run(llvm::Module&) + 244
> 11 opt             0x00000000013631f9 
> llvm::PassManager::run(llvm::Module&) + 39
> 12 opt             0x00000000008719e8 main + 5698
> 13 libc.so.6       0x00007f8934318de5 __libc_start_main + 245
> 14 opt             0x0000000000863779
> Stack dump:
>
> I replace the uses of the old instruction, before deleting it. I am 
> not sure why this is happening.
> any suggestions?
I'm assuming that FaultInjectionPass is your code, correct?

If so, if you look at your stack dump, you'll see that it crashes calling
Instruction::getOpcode() in the method locate_instruction() in your passes
finalize() method.

I recommend finding the line in locate_instruction that is causing the
problem.  If it's one of the 4 lines you list at the top of this email,
check that the old instruction pointer is non-NULL and points to an actual
instruction.

You may also want to check and see if the old instruction is a phi-node.
That might cause issues.

Finally, if you're building a FaultInjector pass, we built a very simple one
for SAFECode that tries to create memory safety errors. You may or may not
find it helpful for what you're doing.

Regards,

John Criswell

>
> Vasileios
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

------------------------------

Message: 4
Date: Thu, 26 Jun 2014 10:54:26 -0500
From: Will Dietz <willdtz at gmail.com>
To: John Criswell <criswell at illinois.edu>
Cc: LLVM Dev <llvmdev at cs.uiuc.edu>
Subject: Re: [LLVMdev] eraseFromParent and stack dump
Message-ID:
	<CAKGWAO8v2prgZDQm6Z=1vN6TXW3w3EBuJCgTbCO5c7puguL0WA at mail.gmail.com>
Content-Type: text/plain; charset=UTF-8

Indeed, it'd be helpful to know the contents of the function
locate_instruction().

Also, a quick suggestion: ensure the FaultInjectionPass doesn't have any
datastructures that are invalidated by your IR modifications, in particular
look for collections of llvm::Instruction* (or
llvm::Value*) that might be used in later code to reference instructions
you've erased.  These datastructures are commonly used for worklists and for
mapping from LLVM constructs to analysis information.

Otherwise a bit more detail on your code and workflow (you say you get a
crash when you run the IR--but the stack dump is from an LLVM pass--is this
part of some interpreter?) would likely help us sort out your issue :).

Hope this helps,

~Will

On Thu, Jun 26, 2014 at 10:29 AM, John Criswell <criswell at illinois.edu>
wrote:
> On 6/26/14, 4:38 AM, Vasileios Koutsoumpos wrote:
>>
>> Hello,
>>
>> I am creating a new instruction and I want to replace the use of a 
>> specified instruction.
>> This is the code I have written
>>
>> Instruction *new_instr = BinaryOperator::Create(Instruction::Sub, 
>> op1, op2, "");
>
>
> As a matter of style, I'd use the version of BinaryOperator::Create() 
> that takes an Instruction * as an insert point and use it to insert 
> the new instruction *before* the old instruction.  It shouldn't matter 
> whether it goes before or after, and it avoids having to do a separate 
> operation to insert the BinaryOperator into the basic block.
>
>
>> b->getInstList().insertAfter(old_instr, new_instr);  //b is the 
>> b->BasicBlock
>> old_instr->replaceAllUsesWith(new_instr);
>> old_instr->eraseFromParent();
>>
>> When I print the basic block, I see that my instruction was inserted  
>> and replaces the old instruction, however I get a stack dump, when I 
>> run the instrumented IR file.
>>
>> 0  opt             0x00000000014680df
>> llvm::sys::PrintStackTrace(_IO_FILE*) + 38
>> 1  opt             0x000000000146835c
>> 2  opt             0x0000000001467dd8
>> 3  libpthread.so.0 0x00007f8934eecbb0
>> 4  llfi-passes.so  0x00007f89340a39b2 llvm::Value::getValueID() const 
>> + 12
>> 5  llfi-passes.so  0x00007f89340a3aa4 llvm::Instruction::getOpcode() 
>> const
>> + 24
>> 6  llfi-passes.so  0x00007f89340a331c 
>> llfi::FaultInjectionPass::locate_instruction(llvm::Module&, 
>> llvm::Function*, llvm::LoopInfo&, int) + 282
>> 7  llfi-passes.so  0x00007f89340a2e50 
>> llfi::FaultInjectionPass::finalize(llvm::Module&, int) + 506
>> 8  llfi-passes.so  0x00007f89340a3509
>> llfi::FaultInjectionPass::runOnModule(llvm::Module&) + 113
>> 9  opt             0x0000000001362a1e
>> llvm::MPPassManager::runOnModule(llvm::Module&) + 502
>> 10 opt             0x0000000001362fee
>> llvm::PassManagerImpl::run(llvm::Module&) + 244
>> 11 opt             0x00000000013631f9
>> llvm::PassManager::run(llvm::Module&) + 39
>> 12 opt             0x00000000008719e8 main + 5698
>> 13 libc.so.6       0x00007f8934318de5 __libc_start_main + 245
>> 14 opt             0x0000000000863779
>> Stack dump:
>>
>> I replace the uses of the old instruction, before deleting it. I am 
>> not sure why this is happening.
>> any suggestions?
>
> I'm assuming that FaultInjectionPass is your code, correct?
>
> If so, if you look at your stack dump, you'll see that it crashes 
> calling
> Instruction::getOpcode() in the method locate_instruction() in your 
> passes
> finalize() method.
>
> I recommend finding the line in locate_instruction that is causing the 
> problem.  If it's one of the 4 lines you list at the top of this 
> email, check that the old instruction pointer is non-NULL and points 
> to an actual instruction.
>
> You may also want to check and see if the old instruction is a phi-node.
> That might cause issues.
>
> Finally, if you're building a FaultInjector pass, we built a very 
> simple one for SAFECode that tries to create memory safety errors. You 
> may or may not find it helpful for what you're doing.
>
> Regards,
>
> John Criswell
>
>
>>
>> Vasileios
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

------------------------------

Message: 5
Date: Thu, 26 Jun 2014 12:46:12 -0400
From: Diego Novillo <dnovillo at google.com>
To: LLVM Developers Mailing List <llvmdev at cs.uiuc.edu>,	Eric
	Christopher <echristo at gmail.com>,	David Blaikie
<dblaikie at gmail.com>,
	octoploid at yandex.com,	craig.topper at gmail.com, Chris Lattner
	<clattner at apple.com>
Subject: [LLVMdev] -gcolumn-info and PR 14106
Message-ID:
	<CAD_=9DSNqoaZw=4p2rR7-3iXASfEHxtxsdsnTJz=Ke7spkJzpg at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

For -Rpass, and other related uses, I am looking at enabling column info by
default. David pointed me at PR 14106, which seems to be the original
motivation for introducing -gcolumn-info. However, I am finding no
differences when using it on this test.  I've tried building with/without
-gcolumn-info and found almost no difference in compile time (+0.4%):

$ /usr/bin/time clang -w -fno-builtin -O2 -g -gcolumn-info test-tgmath2.i
474.38user 2.10system 7:58.00elapsed 99%CPU

$ /usr/bin/time clang -w -fno-builtin -O2 -g  test-tgmath2.i 472.63user
2.02system 7:56.11elapsed 99%CPU

I'm running clang from trunk @211693.

The size of all debug sections (according to readelf) are:

- with -g -gcolumn-info: 836,177 bytes
- with -g: 826,552 bytes

That's a growth of about 1% in debug info size.

These numbers are in line with a comparative build I did of our internal
codebase. The build included a massive number of C and C++ files. For C
files, total file size grows by 1% on average. For C++ files the average
growth is around 0.2%. Build times are unchanged as well.

Does anyone remember any other edge case I may want to try? It seems to me
that these differences are not really worth the effort of having a flag
controlling column information.

Thanks.  Diego.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.cs.uiuc.edu/pipermail/llvmdev/attachments/20140626/72251654/at
tachment-0001.html>

------------------------------

Message: 6
Date: Thu, 26 Jun 2014 11:10:38 -0600
From: Sanjay Patel <spatel at rotateright.com>
To: Manjunath DN <manjunath.dn at gmail.com>
Cc: llvmdev at cs.uiuc.edu
Subject: Re: [LLVMdev] Contributing the Apple ARM64 compiler backend
Message-ID:
	<CA+wODitBvip=26U0psOze-FDFw+7FPvtSWqrWgBSbNq1vHA=FQ at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

>> We've also seen similar instances where multiple registers are used 
>> to
compute very similar
>> addresses (such as x+0 and x+4!) and this increases register pressure.

I don't have an ARM enabled build of the tools to test with, but I suspect
what I'm seeing here:
http://llvm.org/bugs/show_bug.cgi?id=20134

...would also be bad on AArch64.

On Wed, Jun 25, 2014 at 8:58 PM, Manjunath DN <manjunath.dn at gmail.com>
wrote:

> HI James,
> Thanks for your reply and hints on what can be done for the Aarch64 
> backend optimization for llvm We have SPEC license and v8 hardware. So 
> I will start looking into it warm regards Manjunath
>
>
>
> On Wed, Jun 25, 2014 at 8:42 PM, James Molloy <james.molloy at arm.com>
> wrote:
>
>> Hi Manjunath,
>>
>> At the time of writing that status we had only done our initial analysis.
>> This was done without real hardware and attempted to identify poor code
>> sequences but we were unable to quantify how much effect this would
>> actually
>> have.
>>
>> Since then we've done more analysis using Cortex-A57 and Cortex-A53 on an
>> internal development platform.
>>
>> For SPEC, we are between 10% and 0% behind GCC on 9 benchmarks, and 25%
>> ahead on one benchmark. Most benchmarks are less than 5% behind GCC.
>>
>> Because of the licencing of SPEC, I have to be quite restricted in what I
>> say and I can't give any numbers - sorry about that.
>>
>> We are focussing on Cortex-A57, and the things we've identified so far
>> are:
>>   * The CSEL instruction behaves worse than the equivalent branch
>> structure
>> in at least one benchmark. In an out of order core, select-like
>> instructions
>> are going to be slower than their branched equivalent if the branch is
>> predictable due to CSEL having two dependencies.
>>
>>   * Redundant calculations inside if conditions. We've seen:
>>     1. "if (a[x].b < c[y].d || a[x].e > c[y].f)" - the calculations of
>> a[x]
>> and c[y] are repeated, when they are common. We've also seen similar
>> instances where multiple registers are used to compute very similar
>> addresses (such as x+0 and x+4!) and this increases register pressure.
>>     2. "if (a < 0 && b == c || a > 0 && b == d)" - the first comparison
of
>> 'a' against zero is done twice, when the flag results of the first
>> comparison could be used for the second comparison.
>>
>>   * For a loop such as "for (i = 0; i < n; ++i)
>> {do_something_with(&x[i]);}", GCC is using &x[i] as the loop induction
>> variable where LLVM uses i and performs the calculation &x[i] on every
>> iteration. This only creates one more add instruction but the loop we see
>> it
>> in only has 5 or so instructions.
>>
>>   * The inline heuristics are way behind GCC's. If we crank the inline
>> threshold up to 1000, we can remove a 6.5% performance regression from
one
>> benchmark entirely.
>>
>>   * We're generating (due to SLP vectorizer and a DAG combine) loads into
>> Q
>> registers when merging consecutive loads. This is bad, because there are
>> no
>> callee-saved Q registers! So if the live range crosses a function call,
it
>> will have to be immediately spilled again.  This can be easily fixed by
>> using load-pair instructions instead. I have a patch to fix this.
>>
>> The list above is non-exhaustive and only contains things that we think
>> may
>> affect multiple benchmarks or real-world code.
>>
>> I've also noticed:
>>   * Our inline memcpy expansion pass is emitting "LDR q0, [..]; STR q0,
>> [..]" pairs, which is less than ideal on A53. If we switched to emitting
>> "LDP x0, x1, [..]; STP x0, x1, [..]", we'd get around 30% better inline
>> memcpy performance on A53. A57 seems to deal well with the LDR q
sequence.
>>
>> I'm sorry I'm unable to provide code samples for most of the issues found
>> so
>> far - this is an artefact of them having come from SPEC. Trivial examples
>> do
>> not always show the same behaviour, and as we're still investigating we
>> haven't yet been able to reduce most of these to an anonymisable
testcase.
>>
>> Hope this helps, but doubt it does,
>>
>> James
>>
>> > -----Original Message-----
>> > From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu]
>> On
>> > Behalf Of Manjunath N
>> > Sent: 24 June 2014 10:45
>> > To: llvmdev at cs.uiuc.edu
>> > Subject: Re: [LLVMdev] Contributing the Apple ARM64 compiler backend
>> >
>> >
>> >
>> > Eric Christopher <echristo <at> gmail.com> writes:
>> >
>> > >
>> > > > The big pain issues I see merging from ARM64 to AArch64 are:
>> > > > 1.      Apple have created a fairly complete scheduling model
>> already
>> > for
>> > > > ARM64, and we'd have to merge the partial? model in AArch64 and
>> > theirs.
>> > We
>> > > > risk regressing performance on Apple's targets here, and we can't
>> > determine
>> > > > ourselves whether we have or not. This is not ideal.
>> > > > 2.      Porting over the DAG-to-DAG optimizations and any other
>> > > > optimizations that rely on the tablegen layout will be very tricky.
>> > > > 3.      The conditional compare pass is fairly comprehensive - we'd
>> > have
>> > to
>> > > > port that over or rewrite it and that would be a lot of work.
>> > > > 4.      A very quick analysis last night indicated that ARM64 has
>> > > > implemented just under half of the optimizations we discovered
>> > opportunities
>> > > > for in SPEC and EEMBC. That's a fairly comprehensive number of
>> > > > optimizations, and they won't all be easy to port.
>> > Eric,
>> > You mention that there a quite a few  optimization opportunities in
SPEC
>> > 2000/ EEMBC.
>> > I am looking to optimize the Aarch64 backend. Could you please let me
>> know
>> > the big optimizations possible?
>> >
>> >
>> >
>> > _______________________________________________
>> > LLVM Developers mailing list
>> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>
>>
>>
>>
>
>
> --
> =========================================
> warm regards,
> Manjunath DN
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>

-- 
Sanjay Patel
RotateRight, LLC
http://www.rotateright.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.cs.uiuc.edu/pipermail/llvmdev/attachments/20140626/481a77b2/at
tachment-0001.html>

------------------------------

Message: 7
Date: Thu, 26 Jun 2014 18:23:09 +0100
From: "James Molloy" <james.molloy at arm.com>
To: "'Sanjay Patel'" <spatel at rotateright.com>,	"Manjunath DN"
	<manjunath.dn at gmail.com>
Cc: llvmdev at cs.uiuc.edu
Subject: Re: [LLVMdev] Contributing the Apple ARM64 compiler backend
Message-ID: <000501cf9163$4d5acd60$e8106820$@arm.com>
Content-Type: text/plain; charset="utf-8"

Hi Sanjay,

The behaviour I?m talking about I?ve actually pinned down to CodeGenPrepare
not working too well with ISA?s that don?t have a good scaled load. I have a
patch to fix it that is going through performance testing now.

Your testcase seems specific to x86 ? for aarch64 we get the rather spiffy:

_Z3fooPii:                              // @_Z3fooPii

// BB#0:                                // %entry

                add        w8, w1, #1              // =1

                add        w9, w1, #2              // =2

                ldr           w8, [x0, w8, sxtw #2]

                ldr           w9, [x0, w9, sxtw #2]

                add        w8, w9, w8

                str           w8, [x0, w1, sxtw #2]

                ret

The sext can be matched as part of the addressing mode for AArch64 ? maybe
it?s something in codegenprepare for x86 going awry?

Cheers,

James

From: Sanjay Patel [mailto:spatel at rotateright.com] 
Sent: 26 June 2014 18:11
To: Manjunath DN
Cc: James Molloy; llvmdev at cs.uiuc.edu
Subject: Re: [LLVMdev] Contributing the Apple ARM64 compiler backend

>> We've also seen similar instances where multiple registers are used to
compute very similar
>> addresses (such as x+0 and x+4!) and this increases register pressure.

I don't have an ARM enabled build of the tools to test with, but I suspect
what I'm seeing here:
http://llvm.org/bugs/show_bug.cgi?id=20134

...would also be bad on AArch64.

On Wed, Jun 25, 2014 at 8:58 PM, Manjunath DN <manjunath.dn at gmail.com>
wrote:

HI James,

Thanks for your reply and hints on what can be done for the Aarch64 backend
optimization for llvm

We have SPEC license and v8 hardware. So I will start looking into it

warm regards

Manjunath

On Wed, Jun 25, 2014 at 8:42 PM, James Molloy <james.molloy at arm.com> wrote:

Hi Manjunath,

At the time of writing that status we had only done our initial analysis.
This was done without real hardware and attempted to identify poor code
sequences but we were unable to quantify how much effect this would actually
have.

Since then we've done more analysis using Cortex-A57 and Cortex-A53 on an
internal development platform.

For SPEC, we are between 10% and 0% behind GCC on 9 benchmarks, and 25%
ahead on one benchmark. Most benchmarks are less than 5% behind GCC.

Because of the licencing of SPEC, I have to be quite restricted in what I
say and I can't give any numbers - sorry about that.

We are focussing on Cortex-A57, and the things we've identified so far are:
  * The CSEL instruction behaves worse than the equivalent branch structure
in at least one benchmark. In an out of order core, select-like instructions
are going to be slower than their branched equivalent if the branch is
predictable due to CSEL having two dependencies.

  * Redundant calculations inside if conditions. We've seen:
    1. "if (a[x].b < c[y].d || a[x].e > c[y].f)" - the calculations of a[x]
and c[y] are repeated, when they are common. We've also seen similar
instances where multiple registers are used to compute very similar
addresses (such as x+0 and x+4!) and this increases register pressure.
    2. "if (a < 0 && b == c || a > 0 && b == d)" - the first comparison of
'a' against zero is done twice, when the flag results of the first
comparison could be used for the second comparison.

  * For a loop such as "for (i = 0; i < n; ++i)
{do_something_with(&x[i]);}", GCC is using &x[i] as the loop induction
variable where LLVM uses i and performs the calculation &x[i] on every
iteration. This only creates one more add instruction but the loop we see it
in only has 5 or so instructions.

  * The inline heuristics are way behind GCC's. If we crank the inline
threshold up to 1000, we can remove a 6.5% performance regression from one
benchmark entirely.

  * We're generating (due to SLP vectorizer and a DAG combine) loads into Q
registers when merging consecutive loads. This is bad, because there are no
callee-saved Q registers! So if the live range crosses a function call, it
will have to be immediately spilled again.  This can be easily fixed by
using load-pair instructions instead. I have a patch to fix this.

The list above is non-exhaustive and only contains things that we think may
affect multiple benchmarks or real-world code.

I've also noticed:
  * Our inline memcpy expansion pass is emitting "LDR q0, [..]; STR q0,
[..]" pairs, which is less than ideal on A53. If we switched to emitting
"LDP x0, x1, [..]; STP x0, x1, [..]", we'd get around 30% better inline
memcpy performance on A53. A57 seems to deal well with the LDR q sequence.

I'm sorry I'm unable to provide code samples for most of the issues found so
far - this is an artefact of them having come from SPEC. Trivial examples do
not always show the same behaviour, and as we're still investigating we
haven't yet been able to reduce most of these to an anonymisable testcase.

Hope this helps, but doubt it does,

James

> -----Original Message-----
> From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On
> Behalf Of Manjunath N
> Sent: 24 June 2014 10:45
> To: llvmdev at cs.uiuc.edu
> Subject: Re: [LLVMdev] Contributing the Apple ARM64 compiler backend
>
>
>
> Eric Christopher <echristo <at> gmail.com> writes:
>
> >
> > > The big pain issues I see merging from ARM64 to AArch64 are:
> > > 1.      Apple have created a fairly complete scheduling model already
> for
> > > ARM64, and we'd have to merge the partial? model in AArch64 and
> theirs.
> We
> > > risk regressing performance on Apple's targets here, and we can't
> determine
> > > ourselves whether we have or not. This is not ideal.
> > > 2.      Porting over the DAG-to-DAG optimizations and any other
> > > optimizations that rely on the tablegen layout will be very tricky.
> > > 3.      The conditional compare pass is fairly comprehensive - we'd
> have
> to
> > > port that over or rewrite it and that would be a lot of work.
> > > 4.      A very quick analysis last night indicated that ARM64 has
> > > implemented just under half of the optimizations we discovered
> opportunities
> > > for in SPEC and EEMBC. That's a fairly comprehensive number of
> > > optimizations, and they won't all be easy to port.
> Eric,
> You mention that there a quite a few  optimization opportunities in SPEC
> 2000/ EEMBC.
> I am looking to optimize the Aarch64 backend. Could you please let me know
> the big optimizations possible?
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

-- 

=========================================
warm regards,
Manjunath DN

_______________________________________________
LLVM Developers mailing list
LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

-- 
Sanjay Patel
RotateRight, LLC
http://www.rotateright.com 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.cs.uiuc.edu/pipermail/llvmdev/attachments/20140626/18633304/at
tachment-0001.html>

------------------------------

Message: 8
Date: Thu, 26 Jun 2014 11:42:53 -0600
From: Sanjay Patel <spatel at rotateright.com>
To: James Molloy <james.molloy at arm.com>
Cc: Manjunath DN <manjunath.dn at gmail.com>, llvmdev at cs.uiuc.edu
Subject: Re: [LLVMdev] Contributing the Apple ARM64 compiler backend
Message-ID:
	<CA+wODitbpgLHsOwbfk_VzGMKaA+iyFU4=oN2o1u6HSP+QLaerg at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Cool HW trick. :)
Are those 'sxtw' ops free?

I have to look at the HW manuals again, but I don't think x86-64 has that
capability.

On Thu, Jun 26, 2014 at 11:23 AM, James Molloy <james.molloy at arm.com> wrote:

> Hi Sanjay,
>
>
>
> The behaviour I?m talking about I?ve actually pinned down to
> CodeGenPrepare not working too well with ISA?s that don?t have a good
> scaled load. I have a patch to fix it that is going through performance
> testing now.
>
>
>
> Your testcase seems specific to x86 ? for aarch64 we get the rather
spiffy:
>
>
>
> _Z3fooPii:                              // @_Z3fooPii
>
> // BB#0:                                // %entry
>
>                 add        w8, w1, #1              // =1
>
>                 add        w9, w1, #2              // =2
>
>                 ldr           w8, [x0, w8, sxtw #2]
>
>                 ldr           w9, [x0, w9, sxtw #2]
>
>                 add        w8, w9, w8
>
>                 str           w8, [x0, w1, sxtw #2]
>
>                 ret
>
>
>
> The sext can be matched as part of the addressing mode for AArch64 ? maybe
> it?s something in codegenprepare for x86 going awry?
>
>
>
> Cheers,
>
>
>
> James
>
>
>
> *From:* Sanjay Patel [mailto:spatel at rotateright.com]
> *Sent:* 26 June 2014 18:11
> *To:* Manjunath DN
> *Cc:* James Molloy; llvmdev at cs.uiuc.edu
>
> *Subject:* Re: [LLVMdev] Contributing the Apple ARM64 compiler backend
>
>
>
> >> We've also seen similar instances where multiple registers are used to
> compute very similar
> >> addresses (such as x+0 and x+4!) and this increases register pressure.
>
> I don't have an ARM enabled build of the tools to test with, but I suspect
> what I'm seeing here:
> http://llvm.org/bugs/show_bug.cgi?id=20134
>
>
>
> ...would also be bad on AArch64.
>
>
>
> On Wed, Jun 25, 2014 at 8:58 PM, Manjunath DN <manjunath.dn at gmail.com>
> wrote:
>
> HI James,
>
> Thanks for your reply and hints on what can be done for the Aarch64
> backend optimization for llvm
>
> We have SPEC license and v8 hardware. So I will start looking into it
>
> warm regards
>
> Manjunath
>
>
>
>
>
> On Wed, Jun 25, 2014 at 8:42 PM, James Molloy <james.molloy at arm.com>
> wrote:
>
> Hi Manjunath,
>
> At the time of writing that status we had only done our initial analysis.
> This was done without real hardware and attempted to identify poor code
> sequences but we were unable to quantify how much effect this would
> actually
> have.
>
> Since then we've done more analysis using Cortex-A57 and Cortex-A53 on an
> internal development platform.
>
> For SPEC, we are between 10% and 0% behind GCC on 9 benchmarks, and 25%
> ahead on one benchmark. Most benchmarks are less than 5% behind GCC.
>
> Because of the licencing of SPEC, I have to be quite restricted in what I
> say and I can't give any numbers - sorry about that.
>
> We are focussing on Cortex-A57, and the things we've identified so far
are:
>   * The CSEL instruction behaves worse than the equivalent branch
structure
> in at least one benchmark. In an out of order core, select-like
> instructions
> are going to be slower than their branched equivalent if the branch is
> predictable due to CSEL having two dependencies.
>
>   * Redundant calculations inside if conditions. We've seen:
>     1. "if (a[x].b < c[y].d || a[x].e > c[y].f)" - the calculations of
a[x]
> and c[y] are repeated, when they are common. We've also seen similar
> instances where multiple registers are used to compute very similar
> addresses (such as x+0 and x+4!) and this increases register pressure.
>     2. "if (a < 0 && b == c || a > 0 && b == d)" - the first comparison of
> 'a' against zero is done twice, when the flag results of the first
> comparison could be used for the second comparison.
>
>   * For a loop such as "for (i = 0; i < n; ++i)
> {do_something_with(&x[i]);}", GCC is using &x[i] as the loop induction
> variable where LLVM uses i and performs the calculation &x[i] on every
> iteration. This only creates one more add instruction but the loop we see
> it
> in only has 5 or so instructions.
>
>   * The inline heuristics are way behind GCC's. If we crank the inline
> threshold up to 1000, we can remove a 6.5% performance regression from one
> benchmark entirely.
>
>   * We're generating (due to SLP vectorizer and a DAG combine) loads into
Q
> registers when merging consecutive loads. This is bad, because there are
no
> callee-saved Q registers! So if the live range crosses a function call, it
> will have to be immediately spilled again.  This can be easily fixed by
> using load-pair instructions instead. I have a patch to fix this.
>
> The list above is non-exhaustive and only contains things that we think
may
> affect multiple benchmarks or real-world code.
>
> I've also noticed:
>   * Our inline memcpy expansion pass is emitting "LDR q0, [..]; STR q0,
> [..]" pairs, which is less than ideal on A53. If we switched to emitting
> "LDP x0, x1, [..]; STP x0, x1, [..]", we'd get around 30% better inline
> memcpy performance on A53. A57 seems to deal well with the LDR q sequence.
>
> I'm sorry I'm unable to provide code samples for most of the issues found
> so
> far - this is an artefact of them having come from SPEC. Trivial examples
> do
> not always show the same behaviour, and as we're still investigating we
> haven't yet been able to reduce most of these to an anonymisable testcase.
>
> Hope this helps, but doubt it does,
>
> James
>
> > -----Original Message-----
> > From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu]
> On
> > Behalf Of Manjunath N
> > Sent: 24 June 2014 10:45
> > To: llvmdev at cs.uiuc.edu
> > Subject: Re: [LLVMdev] Contributing the Apple ARM64 compiler backend
> >
> >
> >
> > Eric Christopher <echristo <at> gmail.com> writes:
> >
> > >
> > > > The big pain issues I see merging from ARM64 to AArch64 are:
> > > > 1.      Apple have created a fairly complete scheduling model
already
> > for
> > > > ARM64, and we'd have to merge the partial? model in AArch64 and
> > theirs.
> > We
> > > > risk regressing performance on Apple's targets here, and we can't
> > determine
> > > > ourselves whether we have or not. This is not ideal.
> > > > 2.      Porting over the DAG-to-DAG optimizations and any other
> > > > optimizations that rely on the tablegen layout will be very tricky.
> > > > 3.      The conditional compare pass is fairly comprehensive - we'd
> > have
> > to
> > > > port that over or rewrite it and that would be a lot of work.
> > > > 4.      A very quick analysis last night indicated that ARM64 has
> > > > implemented just under half of the optimizations we discovered
> > opportunities
> > > > for in SPEC and EEMBC. That's a fairly comprehensive number of
> > > > optimizations, and they won't all be easy to port.
> > Eric,
> > You mention that there a quite a few  optimization opportunities in SPEC
> > 2000/ EEMBC.
> > I am looking to optimize the Aarch64 backend. Could you please let me
> know
> > the big optimizations possible?
> >
> >
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
>
>
> --
>
> =========================================
> warm regards,
> Manjunath DN
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
>
> --
> Sanjay Patel
> RotateRight, LLC
> http://www.rotateright.com
>

-- 
Sanjay Patel
RotateRight, LLC
http://www.rotateright.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.cs.uiuc.edu/pipermail/llvmdev/attachments/20140626/0a37126b/at
tachment.html>

------------------------------

_______________________________________________
LLVMdev mailing list
LLVMdev at cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

End of LLVMdev Digest, Vol 120, Issue 71
****************************************