submit [PATCH] SimplifyCFG for code review

Ye, Mei Mei.Ye at amd.com
Fri Jun 21 10:09:01 PDT 2013


In term of performance,  same optimization can have different impacts on different micro-architecture implementations.  I have often observed this on Intel and AMD x86 architectures.  And different machine configuration (CPU frequency, memory bandwidth etc.) also plays a big role.   Optimizations that bring big gains in one generation of micro-arch can disappear in the next generation.

So the question is:  if there is a requirement of proof, is there a standard on which architecture with what kind of machine configurations?

-Mei


From: Renato Golin [mailto:renato.golin at linaro.org]
Sent: Friday, June 21, 2013 12:41 AM
To: Ye, Mei
Cc: Evan Cheng; llvm-commits at cs.uiuc.edu
Subject: Re: submit [PATCH] SimplifyCFG for code review

On 21 June 2013 06:38, Ye, Mei <Mei.Ye at amd.com<mailto:Mei.Ye at amd.com>> wrote:
This transformation reduces branches.  It can benefit architectures with deep pipelines, less powerful branch predictors, or branch divergence on GPUs. I did have a GPU benchmark that shows roughly 1.5X performance improvement.

That is a big improvement on a specific benchmark on a very narrow category of targets. I share Evan's concerns, as quite often what's good for GPUs is not for CPUs and vice versa.


But on the other hand, there is probably very few optimizations that can benefit all architectures.  And it is also unrealistic to have performance measurement on all architectures to justify an optimization item.    What I am seeing is that compiler vendors have a tendency to push codes into their target space as much as possible, often at the expense of code quality that minimizes code-sharing and increases compilation time.

Indeed, and it's the job of the maintainers to make sure they get laid down properly. As it stands, I think it could bring more harm than good, and you haven't provided much information to say otherwise.


There is definitely a need to enable target-specific tuning in machine-independent optimizations.  Is there a guide line on a good approach to do this?  I have seen some cases that rely on threshold tuning, which can be non-deterministic and therefore unreliable.

Normally what people do is to add it as a pass, add a flag disabled by default, and use it on their specific problems. If more and more people find it useful, some targets or special configurations can turn it on by default on front-ends, or when special configurations are found (ex. when NEON/SSE is present and the pipeline is this or that way).

Whatever you add by default has to be proven beneficial on *most* configurations of *most* targets, not on a single GPU implementation. When that happens, you normally see an improvement of 1% or less, not 50%, but what matters more is that there are no regressions in performance. If there is, it can't be enabled by default.

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20130621/ee595ffc/attachment.html>


More information about the llvm-commits mailing list