[PATCH] Allow BB duplication threshold to be adjusted through JumpThreading's ctor

Tue Sep 30 01:01:55 PDT 2014

On Sep 30, 2014, at 12:51 AM, Hal Finkel <hfinkel at anl.gov> wrote:

> ----- Original Message -----
>> From: "Michael Liao" <michael.liao at intel.com>
>> To: "Hal Finkel" <hfinkel at anl.gov>
>> Cc: reviews+D5444+public+de6f72cb2e4729d3 at reviews.llvm.org, spatel at rotateright.com, llvm-commits at cs.uiuc.edu,
>> nrotem at apple.com
>> Sent: Monday, September 29, 2014 11:10:38 PM
>> Subject: Re: [PATCH] Allow BB duplication threshold to be adjusted through JumpThreading's ctor
>> 
>> 
>> 
>> On Mon, 29 Sep, 2014 at 5:10 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>>> ----- Original Message -----
>>>> From: "Michael Liao" <michael.liao at intel.com>
>>>> To: "michael liao" <michael.liao at intel.com>, nrotem at apple.com,
>>>> hfinkel at anl.gov
>>>> Cc: spatel at rotateright.com, llvm-commits at cs.uiuc.edu
>>>> Sent: Monday, September 29, 2014 6:34:36 PM
>>>> Subject: Re: [PATCH] Allow BB duplication threshold to be
>>>> adjusted
>>>> through JumpThreading's ctor
>>>> 
>>>> Hi Hal
>>>> 
>>>> Yeah, "noduplicate" could prevent duplicating of barrier calls
>>>> but
>>>> that
>>>> patch wants to address the potential issue on processors with
>>>> divergent
>>>> control flow, commonly found in GPUs, e.g. AMD/NVIDIA ones. The
>>>> scenario is that, if BB is duplicated to exploit more jump
>>>> threading,
>>>> targets with divergent CF may execute more instructions if the
>>>> condition is a divergent one.
>>>> 
>>>> For updating that threshold from TTI, yeah, if we are interested
>>>> in
>>>> that case. I could come another patch considering both TTI and
>>>> user-specified threshold.
>>> 
>>> I suppose that I don't understand what you mean by "if we are
>>> interested." Generally speaking, ctor parameters are useful only
>>> for
>>> clients who are not using the standard optimization pipeline, and
>>> we'd like the standard optimization pipeline to generally work well
>>> for a wide range of targets. Thus, a TTI interface is preferred.
>> 
>> OK, I will add another patch with TTI support.
>> 
>>> 
>>> 
>>> From a cost modeling perspective, how can you tell whether the
>>> instruction duplication will be worthwhile. Can this be something
>>> like 2*(instruction costs) <= (branch cost)?
>> 
>> To be honest, I have no concrete answer as the instruction cost may
>> be
>> changed significantly after merging two BB, which is not fully
>> considered in the current cost model. E.g., if inst-fold kicks in
>> after
>> duplicating that BB and folds all instructions. Probably a better
>> place
>> to address that is to add a similar pass in backend with detailed
>> target model. So far, this patch only allows brief control of that
>> threshold.
> 
> The problem of estimating what costs will be after instruction folding is faced by many mid-level passes, and while a machine-instruction-level pass could do a better job at cost modeling, those passes often run too late to enable other optimizations, interact with inlining, etc.
> 
> That having been said, currently, getJumpThreadDuplicationCost does not use any of the current TTI-based cost modeling infrastructure (it pre-dates TTI), and I agree that it will provide a poor estimate of the ultimate cost because it has no understand of what the target will be able to fold. I suspect it would be better to make the function work more like CodeMetrics::analyzeBasicBlock so that the target can inform the estimation of the cost of each instruction (even the base TTI implemented has some intelligence that can be applied). I suspect that proving getJumpThreadDuplicationCost with an actual target-informed method for estimating costs will ultimately yield better results for everyone

If you want to go the code model route, you’re going to want to introduce some concept of duplication cost for an instruction.  This code would be marginal on most CPUs (the only cost is code size), but significant on GPUs (or CPUs programmed in a SPMD fashion) where duplicated instructions have significant cost, even if unexecuted, because they reduce the utilization of the machine’s vector width.

—Owen

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140930/52eb2bb6/attachment.html>