[PATCH] Initial instrumentation based PGO implementation

Bob Wilson bob.wilson at apple.com
Fri Dec 6 15:01:21 PST 2013


On Dec 1, 2013, at 8:15 PM, Justin Bogner <mail at justinbogner.com> wrote:

> Hey list,
> 
> Attached are the initial patches for instrumentation based PGO. The
> first two are for clang, and the third for compiler-rt. 
> 
> This is an implementation of Bob's proposal from back in September. Most
> of the design can be read about here:
> 
>    http://lists.cs.uiuc.edu/pipermail/cfe-dev/2013-September/031825.html
> 
> I changed the flags to -fprofile-instr-use and -fprofile-instr-generate
> to match the style of Diego's sampling based PGO flag. I'm open to other
> suggestions, but do note that using -fprofile-generate introduces a bit
> of an awkward conflict with the GCC-like GCDA profiling that is also
> present in LLVM, since that's the flag GCC uses.
> 
> There are a few things that obviously still need work, but I think this
> is a pretty reasonable incremental step. Notably:
> 
> - The data format is inefficent plain text that isn't even particularly
>  easily understood by a human.
> - The compiler-rt support is fairly limited, and there's no way to
>  specify or change the profile data output file name.
> - Objective C is basically completely ignored.
> - The counters aren't scaled, so if they're greater than 32 bits they'll
>  just be truncated.
> - The code coverage aspect isn't included in this patch, but the
>  counters are set up so that the data should be sufficient other than
>  that there's no real way to convert back to source locations yet.
> 
> Let me know what you think. Thanks.

I’ve got a few minor comments below.  The only serious one is that I don’t think your handling of switch case-ranges and defaults is correct.  Otherwise, it looks good.  Since this is a pretty big change, I’d like John to give his approval before you commit, since he is the code owner for this part of clang.

Once the dust settles on this, we’re going to need to add some good documentation.  It really isn’t obvious how this all fits together.  I like the way you’ve revised the loop instrumentation to move some of the counters to less frequently executed code.  That will help make the instrumented code run faster.  I think we should investigate taking it one step further, but that can wait for a follow-up patch.  Specifically, I think we can avoid adding extra counters for breaks and continues in loops, since we should be able to compute the counts for those without extra instrumentation overhead.  We can discuss that later.

> diff --git a/include/clang/Driver/Options.td b/include/clang/Driver/Options.td
> index 9e7dc78..02d1087 100644
> --- a/include/clang/Driver/Options.td
> +++ b/include/clang/Driver/Options.td
> @@ -369,6 +369,13 @@ def fno_autolink : Flag <["-"], "fno-autolink">, Group<f_Group>,
>  def fprofile_sample_use_EQ : Joined<["-"], "fprofile-sample-use=">,
>      Group<f_Group>, Flags<[DriverOption, CC1Option]>,
>      HelpText<"Enable sample-based profile guided optimizations">;
> +def fprofile_instr_generate : Flag<["-"], "fprofile-instr-generate">,
> +    Group<f_Group>, Flags<[CC1Option]>,
> +    HelpText<"Instrument regions in each function to collect execution counts”>;

I know this comes from my initial patch, but I would prefer to drop the description of this as “instrumenting regions”.  That’s more of an internal implementation detail. How about just: “Generate instrumented code to collect execution counts”?

> +def fprofile_instr_use : Flag<["-"], "fprofile-instr-use">, Group<f_Group>;
> +def fprofile_instr_use_EQ : Joined<["-"], "fprofile-instr-use=">,
> +    Group<f_Group>, Flags<[CC1Option]>,
> +    HelpText<"Use region instrumentation data for profile-guided optimization”>;

Likewise.  How about just dropping the word “region” from the description?

> 
>  def fblocks : Flag<["-"], "fblocks">, Group<f_Group>, Flags<[CC1Option]>,
>    HelpText<"Enable the 'blocks' language feature">;
> diff --git a/include/clang/Frontend/CodeGenOptions.def b/include/clang/Frontend/CodeGenOptions.def
> index 78b825d..154bf44 100644
> --- a/include/clang/Frontend/CodeGenOptions.def
> +++ b/include/clang/Frontend/CodeGenOptions.def
> @@ -87,6 +87,9 @@ CODEGENOPT(OmitLeafFramePointer , 1, 0) ///< Set when -momit-leaf-frame-pointer
>  VALUE_CODEGENOPT(OptimizationLevel, 3, 0) ///< The -O[0-4] option specified.
>  VALUE_CODEGENOPT(OptimizeSize, 2, 0) ///< If -Os (==1) or -Oz (==2) is specified.
> 
> +CODEGENOPT(ProfileInstrRegions , 1, 0) ///< Instrument regions with execution
> +                                       ///< counts to use with PGO.
> +

How about renaming to “ProfileInstrGenerate” to match the option?

>    /// If -fpcc-struct-return or -freg-struct-return is specified.
>  ENUM_CODEGENOPT(StructReturnConvention, StructReturnConventionKind, 2, SRCK_Default)
...
> diff --git a/lib/CodeGen/CGStmt.cpp b/lib/CodeGen/CGStmt.cpp
> index 8325ddd..16d10a6 100644
> --- a/lib/CodeGen/CGStmt.cpp
> +++ b/lib/CodeGen/CGStmt.cpp
...
> @@ -573,16 +601,20 @@ void CodeGenFunction::EmitDoStmt(const DoStmt &S) {
>    JumpDest LoopExit = getJumpDestInCurrentScope("do.end");
>    JumpDest LoopCond = getJumpDestInCurrentScope("do.cond");
> 
> +  unsigned Counter = PGO.getRegionCounter(&S);
> +  RegionCounter Cnt(PGO, Counter);
> +
>    // Store the blocks to use for break and continue.
> -  BreakContinueStack.push_back(BreakContinue(LoopExit, LoopCond));
> +  BreakContinueStack.push_back(BreakContinue(LoopExit, LoopCond, Counter));
> 
> -  // Emit the body of the loop.

Why delete the comment?

>    llvm::BasicBlock *LoopBody = createBasicBlock("do.body");
>    EmitBlock(LoopBody);
> +  Cnt.beginRegion(Builder);
>    {
>      RunCleanupsScope BodyScope(*this);
>      EmitStmt(S.getBody());
>    }
> +  Cnt.adjustFallThroughCount();
> 
>    BreakContinueStack.pop_back();
> 
...
> @@ -923,7 +1004,10 @@ void CodeGenFunction::EmitCaseStmtRange(const CaseStmt &S) {
>    // FIXME: parameters such as this should not be hardcoded.
>    if (Range.ult(llvm::APInt(Range.getBitWidth(), 64))) {
>      // Range is small enough to add multiple switch instruction cases.
> +    uint64_t Total = CaseCnt.getCount() - CaseCnt.getParentCount();
>      for (unsigned i = 0, e = Range.getZExtValue() + 1; i != e; ++i) {
> +      if (PGO.haveRegionCounts())
> +        SwitchWeights->push_back(Total / e);
>        SwitchInsn->addCase(Builder.getInt(LHS), CaseDest);
>        LHS++;
>      }

You dropped the code that I had here to make sure that the sum of the weights adds up to the original count.  I’m not absolutely certain that is necessary, but it seems like you could get some bad results otherwise due to rounding errors.  Are you sure it doesn’t matter?

> @@ -948,7 +1032,10 @@ void CodeGenFunction::EmitCaseStmtRange(const CaseStmt &S) {
>      Builder.CreateSub(SwitchInsn->getCondition(), Builder.getInt(LHS));
>    llvm::Value *Cond =
>      Builder.CreateICmpULE(Diff, Builder.getInt(Range), "inbounds");
> -  Builder.CreateCondBr(Cond, CaseDest, FalseDest);
> +  uint64_t ThisCount = CaseCnt.getCount() - CaseCnt.getParentCount();
> +  uint64_t ElseCount = SwitchEntryCount - ThisCount;

Maybe I’m misunderstanding, but this looks wrong to me.  This branch is on the chain of range checks that ends in the default.  You can’t find the “else” count that way.

> +  Builder.CreateCondBr(Cond, CaseDest, FalseDest,
> +                       PGO.createBranchWeights(ThisCount, ElseCount));
> 
>    // Restore the appropriate insertion point.
>    if (RestoreBB)
...
> @@ -1031,6 +1131,13 @@ void CodeGenFunction::EmitDefaultStmt(const DefaultStmt &S) {
>    assert(DefaultBlock->empty() &&
>           "EmitDefaultStmt: Default block already defined?");
>    EmitBlock(DefaultBlock);
> +
> +  RegionCounter Cnt = getPGORegionCounter(&S);
> +  // We already put a dummy weight in for the default case when we created the
> +  // switch, so we just need to fill it in.
> +  if (PGO.haveRegionCounts())
> +    (*SwitchWeights)[0] = Cnt.getCount() - Cnt.getParentCount();
> +  Cnt.beginRegion(Builder);
>    EmitStmt(S.getSubStmt());
>  }
> 
Same issue as above: the switch weight for the default needs to include the counts for the case-ranges that were chained onto the default.  I think my patch had this right, but you lost that part.

...
> diff --git a/lib/CodeGen/CodeGenPGO.h b/lib/CodeGen/CodeGenPGO.h
> new file mode 100644
> index 0000000..35cdfe0
> --- /dev/null
> +++ b/lib/CodeGen/CodeGenPGO.h
> @@ -0,0 +1,202 @@
...
> +  /// Assign counters to regions and configure them for PGO of a given
> +  /// function. Does nothing if instrumentation is not enabled and either
> +  /// generates global variables or associates PGO data with each of the
> +  /// counters depending on whether we are generating or using instrumentation.
> +  void assignRegionCounters(GlobalDecl &GD);
> +  /// Emit code to write counts for a given function to disk, if necessary.
> +  void emitWriteoutFunction(GlobalDecl &GD);
> +  /// Clean up region counter state. Must be called if assignRegionCounters is
> +  /// used.
> +  void clearRegionCounters();

When I see the word “clear”, I tend to think of resetting the counts to zero, but this function is for freeing the storage. I would prefer a name that reflected that, perhaps “free”, “delete” or “deallocate”.

...
> +  /// Emit an increment to the counter. One should normally prefer beginRegion
> +  /// to emitIncrement, except for in cases where the counter's region is
> +  /// elided.
> +  void emitIncrement(CGBuilderTy &Builder) {
> +    PGO->emitCounterIncrement(Builder, Counter);
> +  }

Justin and I discussed this in person. I think there is no need for this function anymore.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20131206/d517ddf5/attachment.html>


More information about the cfe-commits mailing list