[PATCH] LLVM changes for indirect call target profiling support

Mon Jun 15 21:24:30 PDT 2015

>> "Betul Buyukkurt" <betulb at codeaurora.org> writes:
>>> Hi Justin,
>>>
>>> Since value profiling changes are very large in size and complexity
>>> wise
>>> required multiple design iterations (i.e. IC-profiling -> value
>> profiling
>>> -> raw file size reduction), I did not reply to much of the minor
>>> commentary since the design changes usually resulted in those fixes
>> being
>>> changed again anyway. So I saved any of those typo, naming, wording,
>>> doc
>>> extending etc related comments addressing for merge time. Today I've
>>> posted first of my merge intended CL's for review where all of your
>>> concerns are addressed.
>>
>> I'm not sure why we need yet another thread for that. In any case, two
>> things:
>
> I thought splitting the LLVM change into smaller CL's was what you
> proposed when you suggested (quoting from your email):
>
> "Finally, the LLVM patch is quite large. We can probably make it a bit
> easier to stage and review by splitting out a couple of things and doing
> them first:
>
> 1. Changing the __llvm_profile_data variables so they're no longer
>    const.
>
> 2. Adding the InstrProfRecord type and updating the unittests. Also,
>    this type should just go in the InstrProf.h header, no need for its
>    own.
>
> 3. Arguably, we could update the raw format and implement the
>    RawInstrProfReader changes (but ignoring the new data) first. If you
>    think this is worthwhile go ahead, but if it'll lead to too much
>    extra busywork it probably isn't worth it.
> "
>
> So I came up with another plan which I posted later in the same thread. Do
> you have any objections to the newer plan?

Below list is what I had proposed as the merge. I'm eager to iron out all
the problems and merge these changes soon. I think merging the intrinsics
alone is not a problem because by the third patch the lowering and the
tests will all come together. The goal is to share manageable size CL's
for final code review.

1. Intrinsic instruction for value profiling.
2. Making changes to the data structures used by the readers/writers, and
updating of the readers to preserve the current formats.
3. Indexed reader/writer changes for version 3 i.e. the one that's going to
be used by the value profiling as well
4. Lowering of profiling intrinsic and generation of Data w/ value profiling
entries. Raw reader updates for reading version 2 data.
5. compiler-rt changes for value profiling.

(4 & 5) changes should go in together not to break the builds.

6. Clang changes for generating instrumentation and attaching of new profile
metadata.

7. Clang flag controls for value profiling types (enabling/disabling
instrumentation/pgo use).

Thanks,
-Betul

> -Betul
>
>>
>> 1. There isn't much point in checking in the addition of the intrinsic
>>    separate from the lowering of the intrinsic, since the first patch
>>    doesn't actually do anything in that case.
>> 2. My comment about precalculating the indices below seems to still need
>>    discussion.
>>
>>> Thanks,
>>> -Betul
>>>
>>>> "Betul Buyukkurt" <betulb at codeaurora.org> writes:
>>>>>>> +'``llvm.instrprof_instrument_target``' Intrinsic
>>>>>>> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>>>>> +
>>>>>>> +Syntax:
>>>>>>> +"""""""
>>>>>>> +
>>>>>>> +::
>>>>>>> +
>>>>>>> +      declare void @llvm.instrprof_instrument_target(i8* <name>,
>> i64
>>>> <hash>,
>>>>>>> +                                                     i32
>>>> <value_kind>,
>>>>>>> +                                                     i64 <value>,
>> i32
>>>> <index>)
>>>>>>> +Overview:
>>>>>>> +"""""""""
>>>>>>> +
>>>>>>> +The '``llvm.instrprof_instrument_target``' intrinsic can be
>>>>>>> emitted
>>>> by a
>>>>>>> +frontend for use with instrumentation based profiling. This will
>>>>>>> be
>>>>>>> +lowered by the ``-instrprof`` pass to find out the target values,
>>>>>>> +instrumented expressions take in a program at runtime.
>>>>>>> +
>>>>>>> +Arguments:
>>>>>>> +""""""""""
>>>>>>> +
>>>>>>> +The first argument is a pointer to a global variable containing
>>>>>>> the
>>>>>>> +name of the entity being instrumented. ``name`` should generally
>>>>>>> be
>>>> the
>>>>>>> +(mangled) function name for a set of counters.
>>>>>>> +
>>>>>>> +The second argument is a hash value that can be used by the
>> consumer
>>>>>>> +of the profile data to detect changes to the instrumented source.
>> It
>>>>>>> +is an error if ``hash`` differs between two instances of
>>>>>>> +``instrprof_instrument_target`` that refer to the same name.
>>>>>>> +
>>>>>>> +The third argument represents the kind of the value profiling that
>> is
>>>> done.
>>>>>>> +The fourth argument is the value of the expression being profiled.
>>>> The last
>>>>>>> +argument is the index of the instrumented expression within
>> ``name``.
>>>>>>> +It should be >= 0.
>>>>>>
>>>>>> These arguments need further explanation.
>>>>>
>>>>> Not addressed in the latest patch.
>>>>
>>>> Please do explain these.
>>>>
>>>>>>> +  data_type ReadDataV3(StringRef K, const unsigned char *D,
>>>> offset_type N) {
>>>>>>>      using namespace support;
>>>>>>> -    // We just treat the data as opaque here. It's simpler to
>> handle
>>>> in
>>>>>>> -    // IndexedInstrProfReader.
>>>>>>> -    unsigned NumEntries = N / sizeof(uint64_t);
>>>>>>> -    DataBuffer.reserve(NumEntries);
>>>>>>> -    for (unsigned I = 0; I < NumEntries; ++I)
>>>>>>> -      DataBuffer.push_back(endian::readNext<uint64_t, little,
>>>> unaligned>(D));
>>>>>>> -    return data_type(K, DataBuffer);
>>>>>>> +    const unsigned char *Start = D;
>>>>>>> +    if (D - Start + sizeof (uint64_t) > N)
>>>>>>> +      return data_type();
>>>>>>> +    uint64_t DataBufferSize = endian::readNext<uint64_t, little,
>>>> unaligned>(D);
>>>>>>> +    DataBuffer.clear();
>>>>>>> +    DataBuffer.reserve(DataBufferSize);
>>>>>>> +    std::vector<uint64_t> CounterBuffer;
>>>>>>> +    std::vector<InstrProfValueRecord> ValueBuffer;
>>>>>>> +    for (unsigned I = 0; I < DataBufferSize; ++I) {
>>>>>>> +      if (D - Start + sizeof (uint64_t) + sizeof(uint32_t) >= N)
>>>>>>> +        return data_type();
>>>>>>> +      uint64_t Hash = endian::readNext<uint64_t, little,
>>>> unaligned>(D);
>>>>>>> +      uint32_t CountsSize = endian::readNext<uint32_t, little,
>>>> unaligned>(D);
>>>>>>> +      if (D - Start + CountsSize * sizeof(uint64_t) >= N)
>>>>>>> +        return data_type();
>>>>>>> +      CounterBuffer.clear();
>>>>>>> +      for (unsigned I = 0; I < CountsSize; ++I)
>>>>>>> +        CounterBuffer.push_back(endian::readNext<uint64_t, little,
>>>> unaligned>(D));
>>>>>>> +
>>>>>>> +      DataBuffer.push_back(InstrProfRecord(K, Hash,
>> CounterBuffer));
>>>> +
>>>>>>> +      for (uint32_t Kind = instr_value_prof_kind::first;
>>>>>>> +           Kind < instr_value_prof_kind::last; ++Kind) {
>>>>>>> +        if (D - Start + 2 * sizeof(uint32_t) > N)
>>>>>>> +          return data_type();
>>>>>>> +        uint32_t ValueKind = endian::readNext<uint32_t, little,
>>>> unaligned>(D);
>>>>>>> +        if (ValueKind != Kind)
>>>>>>> +          return data_type();
>>>>>>> +        uint32_t ValuesSize = endian::readNext<uint32_t, little,
>>>> unaligned>(D);
>>>>>>> +        ValueBuffer.clear();
>>>>>>> +        for (unsigned I = 0; I < ValuesSize; ++I) {
>>>>>>> +          if (D - Start + sizeof(uint32_t) >= N) return
>> data_type();
>>>>>>> +          uint32_t NameSize = endian::readNext<uint32_t, little,
>>>> unaligned>(D);
>>>>>>> +          if (D - Start + NameSize * sizeof(char)>= N)
>>>>>>> +            return data_type();
>>>>>>> +
>>>>>>> +          std::string Name((const char *)D, NameSize);
>>>>>>> +          D += NameSize;
>>>>>>> +
>>>>>>> +          if (D - Start + 2 * sizeof(uint64_t) >= N)
>>>>>>> +            return data_type();
>>>>>>> +          uint64_t Value = endian::readNext<uint64_t, little,
>>>> unaligned>(D);
>>>>>>> +          uint64_t NumTaken = endian::readNext<uint64_t, little,
>>>> unaligned>(D);
>>>>>>> +
>>>>>>> +          if (D - Start + sizeof(uint32_t) > N)
>>>>>>> +            return data_type();
>>>>>>> +          uint32_t CounterIndex = endian::readNext<uint32_t,
>> little,
>>>> unaligned>(D);
>>>>>>> +          ValueBuffer.push_back(
>>>>>>> +            InstrProfValueRecord(Name, Value, NumTaken,
>>>> CounterIndex));
>>>>>>> +        }
>>>>>>> +        DataBuffer[DataBuffer.size()-1].Values[Kind] =
>>>>>>> ValueBuffer;
>>>>>>> +     }
>>>>>>
>>>>>> I guess this'll be simpler without having to deal with the kinds,
>>>>>> but
>>>>>> there's also a lot of "D - Start + ..." math in here that I think
>> would
>>>> be
>>>>>> easier to understand with a few more named variables.
>>>>>
>>>>> Not addressed in the latest patch.
>>>>
>>>> I guess this is waiting until we figure out how we're going forward
>> with
>>>> the issue about the kinds, but I think you can make the math clearer
>>>> in
>>>> the meantime regardless.
>>>>
>>>>>>> +    }
>>>>>>> +    return DataBuffer;
>>>>>>>    }
>>>>>>>  };
>>>>>>> +
>>>>>>>  typedef OnDiskIterableChainedHashTable<InstrProfLookupTrait>
>>>>>>>      InstrProfReaderIndex;
>>>>>>> @@ -267,8 +409,6 @@
>>>>>>>    std::unique_ptr<InstrProfReaderIndex> Index;
>>>>>>>    /// Iterator over the profile data.
>>>>>>>    InstrProfReaderIndex::data_iterator RecordIterator;
>>>>>>
>>>>>> This name is a bit confusing now - it iterates over arrays of
>> records,
>>>>>> not individual ones.
>>>>>
>>>>> Not addressed in the latest patch.
>>>>
>>>> Why not?
>>>>
>>>>>>> -  /// Offset into our current data set.
>>>>>>> -  size_t CurrentOffset;
>>>>>>>    /// The file format version of the profile data.
>>>>>>>    uint64_t FormatVersion;
>>>>>>>    /// The maximal execution count among all functions.
>>>>>>> @@ -278,7 +418,7 @@
>>>>>>>    IndexedInstrProfReader &operator=(const IndexedInstrProfReader
>>>>>>> &)
>> =
>>>>>>> delete;
>>>>>>>  public:
>>>>>>>    IndexedInstrProfReader(std::unique_ptr<MemoryBuffer> DataBuffer)
>>>>>>> -      : DataBuffer(std::move(DataBuffer)), Index(nullptr),
>>>>>>> CurrentOffset(0) {}
>>>>>>> +      : DataBuffer(std::move(DataBuffer)), Index(nullptr) {}
>>>>>>>    /// Return true if the given buffer is in an indexed instrprof
>>>>>>> format.
>>>>>>>    static bool hasFormat(const MemoryBuffer &DataBuffer);
>>>>>>> @@ -288,9 +428,16 @@
>>>>>>>    /// Read a single record.
>>>>>>>    std::error_code readNextRecord(InstrProfRecord &Record)
>>>>>>> override;
>>>>>>> -  /// Fill Counts with the profile data for the given function
>> name.
>>>>>>> +  /// Return profile data for the given function name and hash
>>>>>>>    std::error_code getFunctionCounts(StringRef FuncName, uint64_t
>>>> FuncHash,
>>>>>>> -                                    std::vector<uint64_t>
>>>>>>> &Counts);
>>>>>>> +    std::vector<uint64_t> &Counts);
>>>>>>
>>>>>> It really does fill "Counts". It doesn't return anything, so the old
>>>> doc
>>>>>> comment was better.
>>>>>
>>>>> Done.
>>>>>
>>>>>> Also the formatting is funny here and for the function below -
>>>>>> clang-format will fix that for you.
>>>>>
>>>>> Not addressed in the latest patch.
>>>>
>>>> Why not?
>>>>
>>>>>>> +  /// Count the number of instrumented value sites for the
>> function.
>>>>>>> +  void computeNumValueSiteCounts(InstrProfInstrumentTargetInst
>> *Ins);
>>>> +
>>>>>>> +  /// Replace instrprof_instrument_target with a call to runtime
>>>> library.
>>>>>>> +  void lowerInstrumentTargetInst(InstrProfInstrumentTargetInst
>> *Ins);
>>>>>>> +
>>>>>>>    /// Replace instrprof_increment with an increment of the
>>>> appropriate value.
>>>>>>>    void lowerIncrement(InstrProfIncrementInst *Inc);
>>>>>>> @@ -117,20 +132,39 @@
>>>>>>>    bool MadeChange = false;
>>>>>>>    this->M = &M;
>>>>>>> -  RegionCounters.clear();
>>>>>>> +  ProfileDataMap.clear();
>>>>>>>    UsedVars.clear();
>>>>>>> +  // We did not know how many value sites there would be inside
>>>>>>> +  // the instrumented function. This is counting the number of
>>>> instrumented
>>>>>>> +  // target value sites to enter it as field in the profile data
>>>> variable.
>>>>>>> +  for (Function &F : M)
>>>>>>> +    for (BasicBlock &BB : F)
>>>>>>> +      for (auto I = BB.begin(), E = BB.end(); I != E;)
>>>>>>> +        if (auto *Ind =
>> dyn_cast<InstrProfInstrumentTargetInst>(I++))
>>>>>>> +          computeNumValueSiteCounts(Ind);
>>>>>>
>>>>>> The frontend should pass the number of value sites in the function
>>>>>> in
>>>>>> the arguments to the intrinsic, like we do for the number of
>>>>>> counters
>>>> in
>>>>>> the increment intrinsic. In addition to being simpler, it's more
>> robust
>>>>>> against things like dead code elimination which could remove the
>>>>>> call
>>>> to
>>>>>> an intrinsic completely.
>>>>>
>>>>> Unfortunately, this is not straightforwardly doable. We do not know
>> the
>>>>> number of target expressions we'd like to profile prior to inserting
>> the
>>>>> intrinsics. This is due to the fact that AST level expressions are
>>>>> not
>>>> one
>>>>> to one mappable to the generated LLVM IR level expressions. In the
>> case
>>>> of
>>>>> indirect calls for instance the number of calls generated per a
>>>>> single
>>>>> CXXDestructor could be more than one.
>>>>>
>>>>> Instead the current implementation assigns an index by incrementing
>> the
>>>>> NumValueSites counter at each point where a target expression is
>>>> detected.
>>>>> The index value is passed along to the InstrProfiling.cpp as an
>> argument
>>>>> of the intrinsic. Then NumValueSites is determined by taking the max
>> of
>>>>> all indices recorded for a function + 1.
>>>>
>>>> Getting the indices by incrementing NumValueSites as you go leads to a
>>>> few problems - this is actually how I initially implemented the
>>>> counter
>>>> indices, but changed it after finding that it doesn't actually work.
>>>>
>>>> The problems occur if a call site is eliminated before we get to the
>>>> lowering, which can happen today from certain types of dead code
>>>> elimination and may become more common as we make improvements to how
>>>> lowering and optimizations interact. Numbering in the backend means
>> that
>>>> the counter indices and even the number of counters may change
>> depending
>>>> on optimization and essentially disassociates the results from the PGO
>>>> hash and the frontend locations. This loses any advantages of doing
>>>> instrumentation in the frontend, and can easily lead to unintentional
>>>> pessimization when we try to use the data since it's no longer mapped
>>>> correctly.
>>>>
>>>>>>> +
>>>>>>>    for (Function &F : M)
>>>>>>>      for (BasicBlock &BB : F)
>>>>>>>        for (auto I = BB.begin(), E = BB.end(); I != E;)
>>>>>>>          if (auto *Inc = dyn_cast<InstrProfIncrementInst>(I++)) {
>>>>>>>            lowerIncrement(Inc);
>>>>>>>            MadeChange = true;
>>>>>>>          }
>>>>>>> +
>>>>>>> +  for (Function &F : M)
>>>>>>> +    for (BasicBlock &BB : F)
>>>>>>> +      for (auto I = BB.begin(), E = BB.end(); I != E;)
>>>>>>> +        if (auto *Ind =
>> dyn_cast<InstrProfInstrumentTargetInst>(I++))
>>>> {
>>>>>>> +          lowerInstrumentTargetInst(Ind);
>>>>>>> +          MadeChange = true;
>>>>>>> +        }
>>>>>>
>>>>>> It seems like we could do both of these in one loop, no?
>>>>>
>>>>> Not addressed in this patch.
>>>>
>>>> Why not?
>>>>
>>
>
>
>