[LLVMdev] Dev Meeting BOF: Performance Tracking

Tue Aug 5 08:05:09 PDT 2014

> Hi Chad,
>
>> I recall Daniel and I discussing this issue.  IIRC, we considered an
>> eager
>> approach where the current build would rerun the benchmark to verify the
>> spikes.  However, I like the lazy detection approach you're suggesting.
>> This avoids long running builds when there are real regressions.
>
> I think the real issue behind this one is that it would change LNT from
> being a passive system to an active system. Currently the LNT tests can be
> run in any way one wishes, so long as a report is produced. Similarly, we
> can add other benchmarks to the report, which we currently do internally
> to avoid putting things like EEMBC into LNT's build system.
>
> With an "eager" approach as you mention, LNT would have to know how to ssh
> onto certain boxen, run the command and get the result back. Which would
> be a ton of work to do well!!

Ah, yes.  That makes a great deal of sense.  Thanks, James.

>
> Cheers,
>
> James
>
> -----Original Message-----
> From: Chad Rosier [mailto:mcrosier at codeaurora.org]
> Sent: 05 August 2014 15:42
> To: Renato Golin
> Cc: Kristof Beyls; mcrosier at codeaurora.org; James Molloy; Yi Kong;
> llvmdev at cs.uiuc.edu
> Subject: Re: [LLVMdev] Dev Meeting BOF: Performance Tracking
>
> Kristof,
> Unfortunately, our merge process is less than ideal.  It has vastly
> improved
> over the past few months (years I hear), but we still have times where we
> bring in days/weeks worth of commits en mass.  To that end, I've setup a
> nightly performance run against the community branch, but it's still an
> overwhelming amount of work to track/report/bisect regressions.  As you
> guessed, this is what motivated my initial email.
>
>> On 5 August 2014 10:30, Kristof Beyls <Kristof.Beyls at arm.com> wrote:
>>> The biggest problem that we were trying to solve this year was to
>>> produce data without too much noise. I think with Renato hopefully
>>> setting up a chromebook (Cortex-A15) soon there will finally be an
>>> ARM architecture board producing useful data and pushing it into the
> central database.
>>
>> I haven't got around finishing that work (at least not reporting to
>> Perf anyway) because of the instability issues.
>>
>> I think getting Perf stable is priority 0 right now in the LLVM
>> benchmarking field.
>
> I agree 110%; we don't want the bots crying wolf.  Otherwise, real issues
> will fall on deaf ears.
>
>>> I think this should be the main topic of the BoF this year: now that
>>> we can produce useful data; what do we do with the data to actually
>>> improve LLVM?
>>
>> With the benchmark LNT reporting meaningful results and warning users
>> of spikes, I think we have at least the base covered.
>
> I haven't used LNT in well over a year, but I recall Daniel Dunbar and I
> having many discussion on how LNT could be improved.  (Forgive me if any
> of
> my suggestions have already been address. I'm playing catch up at the
> moment.)
>
>> Further improvements I can think of would be to:
>>
>> * Allow Perf/LNT to fix a set of "golden standards" based on past
>> releases
>> * Mark the levels of those standards on every graph as coloured
>> horizontal lines
>> * Add warning systems when the current values deviate from any past
>> golden standard
>
> I agree.  IIRC, there's functionality to set a baseline run to compare
> against.
> Unfortunately, I think this is too coarse.  It would be great if the
> golden
> standard could be set on a per benchmark basis.  Thus, upward trending
> benchmarks can have their standard updated while other benchmarks remain
> static.
>
>> * Allow Perf/LNT to report on differences between two distinct bots
>> * Create GCC buildbots with the same configurations/architectures and
>> compare them to LLVM's
>> * Mark golden standards for GCC releases, too, as a visual aid (no
>> warnings)
>>
>> * Implement trend detection (gradual decrease of performance) and
>> historical comparisons (against older releases)
>> * Implement warning systems to the admin (not users) for such trends
>
> Would it be useful to detect upwards trends as well?  Per my comment
> above,
> it would be great to update the golden standard so we're always moving in
> the right direction.
>
>> * Improve spike detection to wait one or two more builds to make sure
>> the spike was an actual regression, but then email the original blame
>> list, not the current builds' one.
>
> I recall Daniel and I discussing this issue.  IIRC, we considered an eager
> approach where the current build would rerun the benchmark to verify the
> spikes.  However, I like the lazy detection approach you're suggesting.
> This avoids long running builds when there are real regressions.
>
>> * Implement this feature on all warnings (previous runs, golden
>> standards, GCC comparisons)
>>
>> * Renovate the list of tests and benchmarks, extending their run times
>> dynamically instead of running them multiple times, getting the times
>> for the core functionality instead of whole-program timing, etc.
>
> Could we create a minimal test-suite that includes only benchmarks that
> are
> known to have little variance and run times greater than some decided upon
> threshold?  With that in place we could begin the performance tracking
> (and
> hopefully adoption) sooner.
>
>> I agree with Kristof that, with the world of benchmarks being what it
>> is, focusing on test-suite buildbots will probably give the best
>> return on investment for the community.
>>
>> cheers,
>> --renato
>
> Kristof/All,
> I would be more than happy to contribute to this BOF in any way I can.
>
>  Chad
>
>
> --
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted
> by
> The Linux Foundation
>
>
>
>
>

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation