[PATCH] LLVM changes for indirect call target profiling support

Mon Jun 15 21:46:40 PDT 2015

"Betul Buyukkurt" <betulb at codeaurora.org> writes:
>>> "Betul Buyukkurt" <betulb at codeaurora.org> writes:
>>>> Hi Justin,
>>>>
>>>> Since value profiling changes are very large in size and complexity
>>>> wise
>>>> required multiple design iterations (i.e. IC-profiling -> value
>>> profiling
>>>> -> raw file size reduction), I did not reply to much of the minor
>>>> commentary since the design changes usually resulted in those fixes
>>> being
>>>> changed again anyway. So I saved any of those typo, naming, wording,
>>>> doc
>>>> extending etc related comments addressing for merge time. Today I've
>>>> posted first of my merge intended CL's for review where all of your
>>>> concerns are addressed.
>>>
>>> I'm not sure why we need yet another thread for that. In any case, two
>>> things:
>>
>> I thought splitting the LLVM change into smaller CL's was what you
>> proposed when you suggested (quoting from your email):
>>
>> "Finally, the LLVM patch is quite large. We can probably make it a bit
>> easier to stage and review by splitting out a couple of things and doing
>> them first:
>>
>> 1. Changing the __llvm_profile_data variables so they're no longer
>>    const.
>>
>> 2. Adding the InstrProfRecord type and updating the unittests. Also,
>>    this type should just go in the InstrProf.h header, no need for its
>>    own.
>>
>> 3. Arguably, we could update the raw format and implement the
>>    RawInstrProfReader changes (but ignoring the new data) first. If you
>>    think this is worthwhile go ahead, but if it'll lead to too much
>>    extra busywork it probably isn't worth it.
>> "
>>
>> So I came up with another plan which I posted later in the same thread. Do
>> you have any objections to the newer plan?

Splitting up the patches I approve of - it just seems like there are ten
different email threads for the 3 or 4 patches we've already been
discussing ;) Sorry if I sound like I'm contradicting myself.

> Below list is what I had proposed as the merge. I'm eager to iron out all
> the problems and merge these changes soon. I think merging the intrinsics
> alone is not a problem because by the third patch the lowering and the
> tests will all come together. The goal is to share manageable size CL's
> for final code review.
>
> 1. Intrinsic instruction for value profiling.
> 2. Making changes to the data structures used by the readers/writers, and
> updating of the readers to preserve the current formats.
> 3. Indexed reader/writer changes for version 3 i.e. the one that's going to
> be used by the value profiling as well
> 4. Lowering of profiling intrinsic and generation of Data w/ value profiling
> entries. Raw reader updates for reading version 2 data.

There isn't much value in doing (1) on its own, so I'd probably do it
with (4) if we're doing things in this order. See below though.

> 5. compiler-rt changes for value profiling.
>
> (4 & 5) changes should go in together not to break the builds.
>
> 6. Clang changes for generating instrumentation and attaching of new profile
> metadata.
>
> 7. Clang flag controls for value profiling types (enabling/disabling
> instrumentation/pgo use).

Note that (5) is hard to test without the changes in (7), due to how the
tests in compiler-rt are generally integration-style "run the whole
compiler" type tests today. As such, maybe an order like this works
(I'll use letters so I can refer to your numbering from above):

A) Do (2) and change the reader writer data structures. This is
   basically NFC refactoring, so it's relatively independent.

B) Implement the indexed changes in (3). These should be independently
   testable, probably with unit tests.

C) Do (1) in LLVM and (6) in clang. The intrinsic won't do anything yet,
   but we can write clang tests that check that it shows up in the right
   places, so it isn't useless.

D) Finish off with (7), (4), and (5). You need the changes in (7) so you
   can add tests in (5), and the changes in (4) so that (7) actually
   does anything.

Seven patches, four steps.

There might be one problem with this - there isn't currently a way to
turn off the instrprof lowering pass, so the tests in (C) will break
when we do (D). We should fix this so that clang can just test that it
emits the intrinsics correctly by disabling the lowering.

What do you think?

>
> Thanks,
> -Betul
>
>> -Betul
>>
>>>
>>> 1. There isn't much point in checking in the addition of the intrinsic
>>>    separate from the lowering of the intrinsic, since the first patch
>>>    doesn't actually do anything in that case.
>>> 2. My comment about precalculating the indices below seems to still need
>>>    discussion.
>>>
>>>> Thanks,
>>>> -Betul
>>>>
>>>>> "Betul Buyukkurt" <betulb at codeaurora.org> writes:
>>>>>>>> +'``llvm.instrprof_instrument_target``' Intrinsic
>>>>>>>> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>>>>>> +
>>>>>>>> +Syntax:
>>>>>>>> +"""""""
>>>>>>>> +
>>>>>>>> +::
>>>>>>>> +
>>>>>>>> +      declare void @llvm.instrprof_instrument_target(i8* <name>,
>>> i64
>>>>> <hash>,
>>>>>>>> +                                                     i32
>>>>> <value_kind>,
>>>>>>>> +                                                     i64 <value>,
>>> i32
>>>>> <index>)
>>>>>>>> +Overview:
>>>>>>>> +"""""""""
>>>>>>>> +
>>>>>>>> +The '``llvm.instrprof_instrument_target``' intrinsic can be
>>>>>>>> emitted
>>>>> by a
>>>>>>>> +frontend for use with instrumentation based profiling. This will
>>>>>>>> be
>>>>>>>> +lowered by the ``-instrprof`` pass to find out the target values,
>>>>>>>> +instrumented expressions take in a program at runtime.
>>>>>>>> +
>>>>>>>> +Arguments:
>>>>>>>> +""""""""""
>>>>>>>> +
>>>>>>>> +The first argument is a pointer to a global variable containing
>>>>>>>> the
>>>>>>>> +name of the entity being instrumented. ``name`` should generally
>>>>>>>> be
>>>>> the
>>>>>>>> +(mangled) function name for a set of counters.
>>>>>>>> +
>>>>>>>> +The second argument is a hash value that can be used by the
>>> consumer
>>>>>>>> +of the profile data to detect changes to the instrumented source.
>>> It
>>>>>>>> +is an error if ``hash`` differs between two instances of
>>>>>>>> +``instrprof_instrument_target`` that refer to the same name.
>>>>>>>> +
>>>>>>>> +The third argument represents the kind of the value profiling that
>>> is
>>>>> done.
>>>>>>>> +The fourth argument is the value of the expression being profiled.
>>>>> The last
>>>>>>>> +argument is the index of the instrumented expression within
>>> ``name``.
>>>>>>>> +It should be >= 0.
>>>>>>>
>>>>>>> These arguments need further explanation.
>>>>>>
>>>>>> Not addressed in the latest patch.
>>>>>
>>>>> Please do explain these.
>>>>>
>>>>>>>> +  data_type ReadDataV3(StringRef K, const unsigned char *D,
>>>>> offset_type N) {
>>>>>>>>      using namespace support;
>>>>>>>> -    // We just treat the data as opaque here. It's simpler to
>>> handle
>>>>> in
>>>>>>>> -    // IndexedInstrProfReader.
>>>>>>>> -    unsigned NumEntries = N / sizeof(uint64_t);
>>>>>>>> -    DataBuffer.reserve(NumEntries);
>>>>>>>> -    for (unsigned I = 0; I < NumEntries; ++I)
>>>>>>>> -      DataBuffer.push_back(endian::readNext<uint64_t, little,
>>>>> unaligned>(D));
>>>>>>>> -    return data_type(K, DataBuffer);
>>>>>>>> +    const unsigned char *Start = D;
>>>>>>>> +    if (D - Start + sizeof (uint64_t) > N)
>>>>>>>> +      return data_type();
>>>>>>>> +    uint64_t DataBufferSize = endian::readNext<uint64_t, little,
>>>>> unaligned>(D);
>>>>>>>> +    DataBuffer.clear();
>>>>>>>> +    DataBuffer.reserve(DataBufferSize);
>>>>>>>> +    std::vector<uint64_t> CounterBuffer;
>>>>>>>> +    std::vector<InstrProfValueRecord> ValueBuffer;
>>>>>>>> +    for (unsigned I = 0; I < DataBufferSize; ++I) {
>>>>>>>> +      if (D - Start + sizeof (uint64_t) + sizeof(uint32_t) >= N)
>>>>>>>> +        return data_type();
>>>>>>>> +      uint64_t Hash = endian::readNext<uint64_t, little,
>>>>> unaligned>(D);
>>>>>>>> +      uint32_t CountsSize = endian::readNext<uint32_t, little,
>>>>> unaligned>(D);
>>>>>>>> +      if (D - Start + CountsSize * sizeof(uint64_t) >= N)
>>>>>>>> +        return data_type();
>>>>>>>> +      CounterBuffer.clear();
>>>>>>>> +      for (unsigned I = 0; I < CountsSize; ++I)
>>>>>>>> +        CounterBuffer.push_back(endian::readNext<uint64_t, little,
>>>>> unaligned>(D));
>>>>>>>> +
>>>>>>>> +      DataBuffer.push_back(InstrProfRecord(K, Hash,
>>> CounterBuffer));
>>>>> +
>>>>>>>> +      for (uint32_t Kind = instr_value_prof_kind::first;
>>>>>>>> +           Kind < instr_value_prof_kind::last; ++Kind) {
>>>>>>>> +        if (D - Start + 2 * sizeof(uint32_t) > N)
>>>>>>>> +          return data_type();
>>>>>>>> +        uint32_t ValueKind = endian::readNext<uint32_t, little,
>>>>> unaligned>(D);
>>>>>>>> +        if (ValueKind != Kind)
>>>>>>>> +          return data_type();
>>>>>>>> +        uint32_t ValuesSize = endian::readNext<uint32_t, little,
>>>>> unaligned>(D);
>>>>>>>> +        ValueBuffer.clear();
>>>>>>>> +        for (unsigned I = 0; I < ValuesSize; ++I) {
>>>>>>>> +          if (D - Start + sizeof(uint32_t) >= N) return
>>> data_type();
>>>>>>>> +          uint32_t NameSize = endian::readNext<uint32_t, little,
>>>>> unaligned>(D);
>>>>>>>> +          if (D - Start + NameSize * sizeof(char)>= N)
>>>>>>>> +            return data_type();
>>>>>>>> +
>>>>>>>> +          std::string Name((const char *)D, NameSize);
>>>>>>>> +          D += NameSize;
>>>>>>>> +
>>>>>>>> +          if (D - Start + 2 * sizeof(uint64_t) >= N)
>>>>>>>> +            return data_type();
>>>>>>>> +          uint64_t Value = endian::readNext<uint64_t, little,
>>>>> unaligned>(D);
>>>>>>>> +          uint64_t NumTaken = endian::readNext<uint64_t, little,
>>>>> unaligned>(D);
>>>>>>>> +
>>>>>>>> +          if (D - Start + sizeof(uint32_t) > N)
>>>>>>>> +            return data_type();
>>>>>>>> +          uint32_t CounterIndex = endian::readNext<uint32_t,
>>> little,
>>>>> unaligned>(D);
>>>>>>>> +          ValueBuffer.push_back(
>>>>>>>> +            InstrProfValueRecord(Name, Value, NumTaken,
>>>>> CounterIndex));
>>>>>>>> +        }
>>>>>>>> +        DataBuffer[DataBuffer.size()-1].Values[Kind] =
>>>>>>>> ValueBuffer;
>>>>>>>> +     }
>>>>>>>
>>>>>>> I guess this'll be simpler without having to deal with the kinds,
>>>>>>> but
>>>>>>> there's also a lot of "D - Start + ..." math in here that I think
>>> would
>>>>> be
>>>>>>> easier to understand with a few more named variables.
>>>>>>
>>>>>> Not addressed in the latest patch.
>>>>>
>>>>> I guess this is waiting until we figure out how we're going forward
>>> with
>>>>> the issue about the kinds, but I think you can make the math clearer
>>>>> in
>>>>> the meantime regardless.
>>>>>
>>>>>>>> +    }
>>>>>>>> +    return DataBuffer;
>>>>>>>>    }
>>>>>>>>  };
>>>>>>>> +
>>>>>>>>  typedef OnDiskIterableChainedHashTable<InstrProfLookupTrait>
>>>>>>>>      InstrProfReaderIndex;
>>>>>>>> @@ -267,8 +409,6 @@
>>>>>>>>    std::unique_ptr<InstrProfReaderIndex> Index;
>>>>>>>>    /// Iterator over the profile data.
>>>>>>>>    InstrProfReaderIndex::data_iterator RecordIterator;
>>>>>>>
>>>>>>> This name is a bit confusing now - it iterates over arrays of
>>> records,
>>>>>>> not individual ones.
>>>>>>
>>>>>> Not addressed in the latest patch.
>>>>>
>>>>> Why not?
>>>>>
>>>>>>>> -  /// Offset into our current data set.
>>>>>>>> -  size_t CurrentOffset;
>>>>>>>>    /// The file format version of the profile data.
>>>>>>>>    uint64_t FormatVersion;
>>>>>>>>    /// The maximal execution count among all functions.
>>>>>>>> @@ -278,7 +418,7 @@
>>>>>>>>    IndexedInstrProfReader &operator=(const IndexedInstrProfReader
>>>>>>>> &)
>>> =
>>>>>>>> delete;
>>>>>>>>  public:
>>>>>>>>    IndexedInstrProfReader(std::unique_ptr<MemoryBuffer> DataBuffer)
>>>>>>>> -      : DataBuffer(std::move(DataBuffer)), Index(nullptr),
>>>>>>>> CurrentOffset(0) {}
>>>>>>>> +      : DataBuffer(std::move(DataBuffer)), Index(nullptr) {}
>>>>>>>>    /// Return true if the given buffer is in an indexed instrprof
>>>>>>>> format.
>>>>>>>>    static bool hasFormat(const MemoryBuffer &DataBuffer);
>>>>>>>> @@ -288,9 +428,16 @@
>>>>>>>>    /// Read a single record.
>>>>>>>>    std::error_code readNextRecord(InstrProfRecord &Record)
>>>>>>>> override;
>>>>>>>> -  /// Fill Counts with the profile data for the given function
>>> name.
>>>>>>>> +  /// Return profile data for the given function name and hash
>>>>>>>>    std::error_code getFunctionCounts(StringRef FuncName, uint64_t
>>>>> FuncHash,
>>>>>>>> -                                    std::vector<uint64_t>
>>>>>>>> &Counts);
>>>>>>>> +    std::vector<uint64_t> &Counts);
>>>>>>>
>>>>>>> It really does fill "Counts". It doesn't return anything, so the old
>>>>> doc
>>>>>>> comment was better.
>>>>>>
>>>>>> Done.
>>>>>>
>>>>>>> Also the formatting is funny here and for the function below -
>>>>>>> clang-format will fix that for you.
>>>>>>
>>>>>> Not addressed in the latest patch.
>>>>>
>>>>> Why not?
>>>>>
>>>>>>>> +  /// Count the number of instrumented value sites for the
>>> function.
>>>>>>>> +  void computeNumValueSiteCounts(InstrProfInstrumentTargetInst
>>> *Ins);
>>>>> +
>>>>>>>> +  /// Replace instrprof_instrument_target with a call to runtime
>>>>> library.
>>>>>>>> +  void lowerInstrumentTargetInst(InstrProfInstrumentTargetInst
>>> *Ins);
>>>>>>>> +
>>>>>>>>    /// Replace instrprof_increment with an increment of the
>>>>> appropriate value.
>>>>>>>>    void lowerIncrement(InstrProfIncrementInst *Inc);
>>>>>>>> @@ -117,20 +132,39 @@
>>>>>>>>    bool MadeChange = false;
>>>>>>>>    this->M = &M;
>>>>>>>> -  RegionCounters.clear();
>>>>>>>> +  ProfileDataMap.clear();
>>>>>>>>    UsedVars.clear();
>>>>>>>> +  // We did not know how many value sites there would be inside
>>>>>>>> +  // the instrumented function. This is counting the number of
>>>>> instrumented
>>>>>>>> +  // target value sites to enter it as field in the profile data
>>>>> variable.
>>>>>>>> +  for (Function &F : M)
>>>>>>>> +    for (BasicBlock &BB : F)
>>>>>>>> +      for (auto I = BB.begin(), E = BB.end(); I != E;)
>>>>>>>> +        if (auto *Ind =
>>> dyn_cast<InstrProfInstrumentTargetInst>(I++))
>>>>>>>> +          computeNumValueSiteCounts(Ind);
>>>>>>>
>>>>>>> The frontend should pass the number of value sites in the function
>>>>>>> in
>>>>>>> the arguments to the intrinsic, like we do for the number of
>>>>>>> counters
>>>>> in
>>>>>>> the increment intrinsic. In addition to being simpler, it's more
>>> robust
>>>>>>> against things like dead code elimination which could remove the
>>>>>>> call
>>>>> to
>>>>>>> an intrinsic completely.
>>>>>>
>>>>>> Unfortunately, this is not straightforwardly doable. We do not know
>>> the
>>>>>> number of target expressions we'd like to profile prior to inserting
>>> the
>>>>>> intrinsics. This is due to the fact that AST level expressions are
>>>>>> not
>>>>> one
>>>>>> to one mappable to the generated LLVM IR level expressions. In the
>>> case
>>>>> of
>>>>>> indirect calls for instance the number of calls generated per a
>>>>>> single
>>>>>> CXXDestructor could be more than one.
>>>>>>
>>>>>> Instead the current implementation assigns an index by incrementing
>>> the
>>>>>> NumValueSites counter at each point where a target expression is
>>>>> detected.
>>>>>> The index value is passed along to the InstrProfiling.cpp as an
>>> argument
>>>>>> of the intrinsic. Then NumValueSites is determined by taking the max
>>> of
>>>>>> all indices recorded for a function + 1.
>>>>>
>>>>> Getting the indices by incrementing NumValueSites as you go leads to a
>>>>> few problems - this is actually how I initially implemented the
>>>>> counter
>>>>> indices, but changed it after finding that it doesn't actually work.
>>>>>
>>>>> The problems occur if a call site is eliminated before we get to the
>>>>> lowering, which can happen today from certain types of dead code
>>>>> elimination and may become more common as we make improvements to how
>>>>> lowering and optimizations interact. Numbering in the backend means
>>> that
>>>>> the counter indices and even the number of counters may change
>>> depending
>>>>> on optimization and essentially disassociates the results from the PGO
>>>>> hash and the frontend locations. This loses any advantages of doing
>>>>> instrumentation in the frontend, and can easily lead to unintentional
>>>>> pessimization when we try to use the data since it's no longer mapped
>>>>> correctly.
>>>>>
>>>>>>>> +
>>>>>>>>    for (Function &F : M)
>>>>>>>>      for (BasicBlock &BB : F)
>>>>>>>>        for (auto I = BB.begin(), E = BB.end(); I != E;)
>>>>>>>>          if (auto *Inc = dyn_cast<InstrProfIncrementInst>(I++)) {
>>>>>>>>            lowerIncrement(Inc);
>>>>>>>>            MadeChange = true;
>>>>>>>>          }
>>>>>>>> +
>>>>>>>> +  for (Function &F : M)
>>>>>>>> +    for (BasicBlock &BB : F)
>>>>>>>> +      for (auto I = BB.begin(), E = BB.end(); I != E;)
>>>>>>>> +        if (auto *Ind =
>>> dyn_cast<InstrProfInstrumentTargetInst>(I++))
>>>>> {
>>>>>>>> +          lowerInstrumentTargetInst(Ind);
>>>>>>>> +          MadeChange = true;
>>>>>>>> +        }
>>>>>>>
>>>>>>> It seems like we could do both of these in one loop, no?
>>>>>>
>>>>>> Not addressed in this patch.
>>>>>
>>>>> Why not?
>>>>>
>>>
>>
>>
>>