[LLVMdev] Dev Meeting BOF: Performance Tracking

James Molloy james.molloy at arm.com
Tue Aug 5 07:45:53 PDT 2014


Hi Chad,

> I recall Daniel and I discussing this issue.  IIRC, we considered an eager
approach where the current build would rerun the benchmark to verify the
spikes.  However, I like the lazy detection approach you're suggesting.
This avoids long running builds when there are real regressions.

I think the real issue behind this one is that it would change LNT from
being a passive system to an active system. Currently the LNT tests can be
run in any way one wishes, so long as a report is produced. Similarly, we
can add other benchmarks to the report, which we currently do internally to
avoid putting things like EEMBC into LNT's build system.

With an "eager" approach as you mention, LNT would have to know how to ssh
onto certain boxen, run the command and get the result back. Which would be
a ton of work to do well!

Cheers,

James

-----Original Message-----
From: Chad Rosier [mailto:mcrosier at codeaurora.org] 
Sent: 05 August 2014 15:42
To: Renato Golin
Cc: Kristof Beyls; mcrosier at codeaurora.org; James Molloy; Yi Kong;
llvmdev at cs.uiuc.edu
Subject: Re: [LLVMdev] Dev Meeting BOF: Performance Tracking

Kristof,
Unfortunately, our merge process is less than ideal.  It has vastly improved
over the past few months (years I hear), but we still have times where we
bring in days/weeks worth of commits en mass.  To that end, I've setup a
nightly performance run against the community branch, but it's still an
overwhelming amount of work to track/report/bisect regressions.  As you
guessed, this is what motivated my initial email.

> On 5 August 2014 10:30, Kristof Beyls <Kristof.Beyls at arm.com> wrote:
>> The biggest problem that we were trying to solve this year was to 
>> produce data without too much noise. I think with Renato hopefully 
>> setting up a chromebook (Cortex-A15) soon there will finally be an 
>> ARM architecture board producing useful data and pushing it into the
central database.
>
> I haven't got around finishing that work (at least not reporting to 
> Perf anyway) because of the instability issues.
>
> I think getting Perf stable is priority 0 right now in the LLVM 
> benchmarking field.

I agree 110%; we don't want the bots crying wolf.  Otherwise, real issues
will fall on deaf ears.

>> I think this should be the main topic of the BoF this year: now that 
>> we can produce useful data; what do we do with the data to actually 
>> improve LLVM?
>
> With the benchmark LNT reporting meaningful results and warning users 
> of spikes, I think we have at least the base covered.

I haven't used LNT in well over a year, but I recall Daniel Dunbar and I
having many discussion on how LNT could be improved.  (Forgive me if any of
my suggestions have already been address. I'm playing catch up at the
moment.)

> Further improvements I can think of would be to:
>
> * Allow Perf/LNT to fix a set of "golden standards" based on past 
> releases
> * Mark the levels of those standards on every graph as coloured 
> horizontal lines
> * Add warning systems when the current values deviate from any past 
> golden standard

I agree.  IIRC, there's functionality to set a baseline run to compare
against.
Unfortunately, I think this is too coarse.  It would be great if the golden
standard could be set on a per benchmark basis.  Thus, upward trending
benchmarks can have their standard updated while other benchmarks remain
static.

> * Allow Perf/LNT to report on differences between two distinct bots
> * Create GCC buildbots with the same configurations/architectures and 
> compare them to LLVM's
> * Mark golden standards for GCC releases, too, as a visual aid (no
> warnings)
>
> * Implement trend detection (gradual decrease of performance) and 
> historical comparisons (against older releases)
> * Implement warning systems to the admin (not users) for such trends

Would it be useful to detect upwards trends as well?  Per my comment above,
it would be great to update the golden standard so we're always moving in
the right direction.

> * Improve spike detection to wait one or two more builds to make sure 
> the spike was an actual regression, but then email the original blame 
> list, not the current builds' one.

I recall Daniel and I discussing this issue.  IIRC, we considered an eager
approach where the current build would rerun the benchmark to verify the
spikes.  However, I like the lazy detection approach you're suggesting.
This avoids long running builds when there are real regressions.

> * Implement this feature on all warnings (previous runs, golden 
> standards, GCC comparisons)
>
> * Renovate the list of tests and benchmarks, extending their run times 
> dynamically instead of running them multiple times, getting the times 
> for the core functionality instead of whole-program timing, etc.

Could we create a minimal test-suite that includes only benchmarks that are
known to have little variance and run times greater than some decided upon
threshold?  With that in place we could begin the performance tracking (and
hopefully adoption) sooner.

> I agree with Kristof that, with the world of benchmarks being what it 
> is, focusing on test-suite buildbots will probably give the best 
> return on investment for the community.
>
> cheers,
> --renato

Kristof/All,
I would be more than happy to contribute to this BOF in any way I can.

 Chad


--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by
The Linux Foundation








More information about the llvm-dev mailing list