[PATCH] LLVM changes for indirect call target profiling support

Wed Jun 17 20:29:16 PDT 2015

> "Betul Buyukkurt" <betulb at codeaurora.org> writes:
>>>> "Betul Buyukkurt" <betulb at codeaurora.org> writes:
>>>>> Hi Justin,
>>>>>
>>>>> Since value profiling changes are very large in size and complexity
>>>>> wise
>>>>> required multiple design iterations (i.e. IC-profiling -> value
>>>> profiling
>>>>> -> raw file size reduction), I did not reply to much of the minor
>>>>> commentary since the design changes usually resulted in those fixes
>>>> being
>>>>> changed again anyway. So I saved any of those typo, naming, wording,
>>>>> doc
>>>>> extending etc related comments addressing for merge time. Today I've
>>>>> posted first of my merge intended CL's for review where all of your
>>>>> concerns are addressed.
>>>>
>>>> I'm not sure why we need yet another thread for that. In any case, two
>>>> things:
>>>
>>> I thought splitting the LLVM change into smaller CL's was what you
>>> proposed when you suggested (quoting from your email):
>>>
>>> "Finally, the LLVM patch is quite large. We can probably make it a bit
>>> easier to stage and review by splitting out a couple of things and
> doing
>>> them first:
>>>
>>> 1. Changing the __llvm_profile_data variables so they're no longer
>>>    const.
>>>
>>> 2. Adding the InstrProfRecord type and updating the unittests. Also,
>>>    this type should just go in the InstrProf.h header, no need for its
>>>    own.
>>>
>>> 3. Arguably, we could update the raw format and implement the
>>>    RawInstrProfReader changes (but ignoring the new data) first. If you
>>>    think this is worthwhile go ahead, but if it'll lead to too much
>>>    extra busywork it probably isn't worth it.
>>> "
>>>
>>> So I came up with another plan which I posted later in the same thread.
> Do
>>> you have any objections to the newer plan?
>
> Splitting up the patches I approve of - it just seems like there are ten
> different email threads for the 3 or 4 patches we've already been
> discussing ;) Sorry if I sound like I'm contradicting myself.
>
>> Below list is what I had proposed as the merge. I'm eager to iron out
> all
>> the problems and merge these changes soon. I think merging the
> intrinsics
>> alone is not a problem because by the third patch the lowering and the
>> tests will all come together. The goal is to share manageable size CL's
>> for final code review.
>>
>> 1. Intrinsic instruction for value profiling.
>> 2. Making changes to the data structures used by the readers/writers,
> and
>> updating of the readers to preserve the current formats.
>> 3. Indexed reader/writer changes for version 3 i.e. the one that's going
> to
>> be used by the value profiling as well
>> 4. Lowering of profiling intrinsic and generation of Data w/ value
> profiling
>> entries. Raw reader updates for reading version 2 data.
>
> There isn't much value in doing (1) on its own, so I'd probably do it
> with (4) if we're doing things in this order. See below though.
>
>> 5. compiler-rt changes for value profiling.
>>
>> (4 & 5) changes should go in together not to break the builds.
>>
>> 6. Clang changes for generating instrumentation and attaching of new
> profile
>> metadata.
>>
>> 7. Clang flag controls for value profiling types (enabling/disabling
>> instrumentation/pgo use).
>
> Note that (5) is hard to test without the changes in (7), due to how the
> tests in compiler-rt are generally integration-style "run the whole
> compiler" type tests today. As such, maybe an order like this works
> (I'll use letters so I can refer to your numbering from above):
>
> A) Do (2) and change the reader writer data structures. This is
>    basically NFC refactoring, so it's relatively independent.

I've the CL up for this in: http://reviews.llvm.org/D10493
Please take a look. Once merged, I'm going to post the CL for (3) i.e.
indexed reader/writer changes for version 3.

Thanks,
-Betul

> B) Implement the indexed changes in (3). These should be independently
>    testable, probably with unit tests.
>
> C) Do (1) in LLVM and (6) in clang. The intrinsic won't do anything yet,
>    but we can write clang tests that check that it shows up in the right
>    places, so it isn't useless.
>
> D) Finish off with (7), (4), and (5). You need the changes in (7) so you
>    can add tests in (5), and the changes in (4) so that (7) actually
>    does anything.
>
> Seven patches, four steps.
>
> There might be one problem with this - there isn't currently a way to
> turn off the instrprof lowering pass, so the tests in (C) will break
> when we do (D). We should fix this so that clang can just test that it
> emits the intrinsics correctly by disabling the lowering.
>
> What do you think?
>
>>
>> Thanks,
>> -Betul
>>
>>> -Betul
>>>
>>>>
>>>> 1. There isn't much point in checking in the addition of the intrinsic
>>>>    separate from the lowering of the intrinsic, since the first patch
>>>>    doesn't actually do anything in that case.
>>>> 2. My comment about precalculating the indices below seems to still
> need
>>>>    discussion.
>>>>
>>>>> Thanks,
>>>>> -Betul
>>>>>
>>>>>> "Betul Buyukkurt" <betulb at codeaurora.org> writes:
>>>>>>>>> +'``llvm.instrprof_instrument_target``' Intrinsic
>>>>>>>>> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>>>>>>> +
>>>>>>>>> +Syntax:
>>>>>>>>> +"""""""
>>>>>>>>> +
>>>>>>>>> +::
>>>>>>>>> +
>>>>>>>>> +      declare void @llvm.instrprof_instrument_target(i8* <name>,
>>>> i64
>>>>>> <hash>,
>>>>>>>>> +                                                     i32
>>>>>> <value_kind>,
>>>>>>>>> +                                                     i64
> <value>,
>>>> i32
>>>>>> <index>)
>>>>>>>>> +Overview:
>>>>>>>>> +"""""""""
>>>>>>>>> +
>>>>>>>>> +The '``llvm.instrprof_instrument_target``' intrinsic can be
>>>>>>>>> emitted
>>>>>> by a
>>>>>>>>> +frontend for use with instrumentation based profiling. This will
>>>>>>>>> be
>>>>>>>>> +lowered by the ``-instrprof`` pass to find out the target
> values,
>>>>>>>>> +instrumented expressions take in a program at runtime.
>>>>>>>>> +
>>>>>>>>> +Arguments:
>>>>>>>>> +""""""""""
>>>>>>>>> +
>>>>>>>>> +The first argument is a pointer to a global variable containing
>>>>>>>>> the
>>>>>>>>> +name of the entity being instrumented. ``name`` should generally
>>>>>>>>> be
>>>>>> the
>>>>>>>>> +(mangled) function name for a set of counters.
>>>>>>>>> +
>>>>>>>>> +The second argument is a hash value that can be used by the
>>>> consumer
>>>>>>>>> +of the profile data to detect changes to the instrumented
> source.
>>>> It
>>>>>>>>> +is an error if ``hash`` differs between two instances of
>>>>>>>>> +``instrprof_instrument_target`` that refer to the same name.
>>>>>>>>> +
>>>>>>>>> +The third argument represents the kind of the value profiling
> that
>>>> is
>>>>>> done.
>>>>>>>>> +The fourth argument is the value of the expression being
> profiled.
>>>>>> The last
>>>>>>>>> +argument is the index of the instrumented expression within
>>>> ``name``.
>>>>>>>>> +It should be >= 0.
>>>>>>>>
>>>>>>>> These arguments need further explanation.
>>>>>>>
>>>>>>> Not addressed in the latest patch.
>>>>>>
>>>>>> Please do explain these.
>>>>>>
>>>>>>>>> +  data_type ReadDataV3(StringRef K, const unsigned char *D,
>>>>>> offset_type N) {
>>>>>>>>>      using namespace support;
>>>>>>>>> -    // We just treat the data as opaque here. It's simpler to
>>>> handle
>>>>>> in
>>>>>>>>> -    // IndexedInstrProfReader.
>>>>>>>>> -    unsigned NumEntries = N / sizeof(uint64_t);
>>>>>>>>> -    DataBuffer.reserve(NumEntries);
>>>>>>>>> -    for (unsigned I = 0; I < NumEntries; ++I)
>>>>>>>>> -      DataBuffer.push_back(endian::readNext<uint64_t, little,
>>>>>> unaligned>(D));
>>>>>>>>> -    return data_type(K, DataBuffer);
>>>>>>>>> +    const unsigned char *Start = D;
>>>>>>>>> +    if (D - Start + sizeof (uint64_t) > N)
>>>>>>>>> +      return data_type();
>>>>>>>>> +    uint64_t DataBufferSize = endian::readNext<uint64_t, little,
>>>>>> unaligned>(D);
>>>>>>>>> +    DataBuffer.clear();
>>>>>>>>> +    DataBuffer.reserve(DataBufferSize);
>>>>>>>>> +    std::vector<uint64_t> CounterBuffer;
>>>>>>>>> +    std::vector<InstrProfValueRecord> ValueBuffer;
>>>>>>>>> +    for (unsigned I = 0; I < DataBufferSize; ++I) {
>>>>>>>>> +      if (D - Start + sizeof (uint64_t) + sizeof(uint32_t) >= N)
>>>>>>>>> +        return data_type();
>>>>>>>>> +      uint64_t Hash = endian::readNext<uint64_t, little,
>>>>>> unaligned>(D);
>>>>>>>>> +      uint32_t CountsSize = endian::readNext<uint32_t, little,
>>>>>> unaligned>(D);
>>>>>>>>> +      if (D - Start + CountsSize * sizeof(uint64_t) >= N)
>>>>>>>>> +        return data_type();
>>>>>>>>> +      CounterBuffer.clear();
>>>>>>>>> +      for (unsigned I = 0; I < CountsSize; ++I)
>>>>>>>>> +        CounterBuffer.push_back(endian::readNext<uint64_t,
> little,
>>>>>> unaligned>(D));
>>>>>>>>> +
>>>>>>>>> +      DataBuffer.push_back(InstrProfRecord(K, Hash,
>>>> CounterBuffer));
>>>>>> +
>>>>>>>>> +      for (uint32_t Kind = instr_value_prof_kind::first;
>>>>>>>>> +           Kind < instr_value_prof_kind::last; ++Kind) {
>>>>>>>>> +        if (D - Start + 2 * sizeof(uint32_t) > N)
>>>>>>>>> +          return data_type();
>>>>>>>>> +        uint32_t ValueKind = endian::readNext<uint32_t, little,
>>>>>> unaligned>(D);
>>>>>>>>> +        if (ValueKind != Kind)
>>>>>>>>> +          return data_type();
>>>>>>>>> +        uint32_t ValuesSize = endian::readNext<uint32_t, little,
>>>>>> unaligned>(D);
>>>>>>>>> +        ValueBuffer.clear();
>>>>>>>>> +        for (unsigned I = 0; I < ValuesSize; ++I) {
>>>>>>>>> +          if (D - Start + sizeof(uint32_t) >= N) return
>>>> data_type();
>>>>>>>>> +          uint32_t NameSize = endian::readNext<uint32_t, little,
>>>>>> unaligned>(D);
>>>>>>>>> +          if (D - Start + NameSize * sizeof(char)>= N)
>>>>>>>>> +            return data_type();
>>>>>>>>> +
>>>>>>>>> +          std::string Name((const char *)D, NameSize);
>>>>>>>>> +          D += NameSize;
>>>>>>>>> +
>>>>>>>>> +          if (D - Start + 2 * sizeof(uint64_t) >= N)
>>>>>>>>> +            return data_type();
>>>>>>>>> +          uint64_t Value = endian::readNext<uint64_t, little,
>>>>>> unaligned>(D);
>>>>>>>>> +          uint64_t NumTaken = endian::readNext<uint64_t, little,
>>>>>> unaligned>(D);
>>>>>>>>> +
>>>>>>>>> +          if (D - Start + sizeof(uint32_t) > N)
>>>>>>>>> +            return data_type();
>>>>>>>>> +          uint32_t CounterIndex = endian::readNext<uint32_t,
>>>> little,
>>>>>> unaligned>(D);
>>>>>>>>> +          ValueBuffer.push_back(
>>>>>>>>> +            InstrProfValueRecord(Name, Value, NumTaken,
>>>>>> CounterIndex));
>>>>>>>>> +        }
>>>>>>>>> +        DataBuffer[DataBuffer.size()-1].Values[Kind] =
>>>>>>>>> ValueBuffer;
>>>>>>>>> +     }
>>>>>>>>
>>>>>>>> I guess this'll be simpler without having to deal with the kinds,
>>>>>>>> but
>>>>>>>> there's also a lot of "D - Start + ..." math in here that I think
>>>> would
>>>>>> be
>>>>>>>> easier to understand with a few more named variables.
>>>>>>>
>>>>>>> Not addressed in the latest patch.
>>>>>>
>>>>>> I guess this is waiting until we figure out how we're going forward
>>>> with
>>>>>> the issue about the kinds, but I think you can make the math clearer
>>>>>> in
>>>>>> the meantime regardless.
>>>>>>
>>>>>>>>> +    }
>>>>>>>>> +    return DataBuffer;
>>>>>>>>>    }
>>>>>>>>>  };
>>>>>>>>> +
>>>>>>>>>  typedef OnDiskIterableChainedHashTable<InstrProfLookupTrait>
>>>>>>>>>      InstrProfReaderIndex;
>>>>>>>>> @@ -267,8 +409,6 @@
>>>>>>>>>    std::unique_ptr<InstrProfReaderIndex> Index;
>>>>>>>>>    /// Iterator over the profile data.
>>>>>>>>>    InstrProfReaderIndex::data_iterator RecordIterator;
>>>>>>>>
>>>>>>>> This name is a bit confusing now - it iterates over arrays of
>>>> records,
>>>>>>>> not individual ones.
>>>>>>>
>>>>>>> Not addressed in the latest patch.
>>>>>>
>>>>>> Why not?
>>>>>>
>>>>>>>>> -  /// Offset into our current data set.
>>>>>>>>> -  size_t CurrentOffset;
>>>>>>>>>    /// The file format version of the profile data.
>>>>>>>>>    uint64_t FormatVersion;
>>>>>>>>>    /// The maximal execution count among all functions.
>>>>>>>>> @@ -278,7 +418,7 @@
>>>>>>>>>    IndexedInstrProfReader &operator=(const IndexedInstrProfReader
>>>>>>>>> &)
>>>> =
>>>>>>>>> delete;
>>>>>>>>>  public:
>>>>>>>>>    IndexedInstrProfReader(std::unique_ptr<MemoryBuffer>
> DataBuffer)
>>>>>>>>> -      : DataBuffer(std::move(DataBuffer)), Index(nullptr),
>>>>>>>>> CurrentOffset(0) {}
>>>>>>>>> +      : DataBuffer(std::move(DataBuffer)), Index(nullptr) {}
>>>>>>>>>    /// Return true if the given buffer is in an indexed instrprof
>>>>>>>>> format.
>>>>>>>>>    static bool hasFormat(const MemoryBuffer &DataBuffer);
>>>>>>>>> @@ -288,9 +428,16 @@
>>>>>>>>>    /// Read a single record.
>>>>>>>>>    std::error_code readNextRecord(InstrProfRecord &Record)
>>>>>>>>> override;
>>>>>>>>> -  /// Fill Counts with the profile data for the given function
>>>> name.
>>>>>>>>> +  /// Return profile data for the given function name and hash
>>>>>>>>>    std::error_code getFunctionCounts(StringRef FuncName, uint64_t
>>>>>> FuncHash,
>>>>>>>>> -                                    std::vector<uint64_t>
>>>>>>>>> &Counts);
>>>>>>>>> +    std::vector<uint64_t> &Counts);
>>>>>>>>
>>>>>>>> It really does fill "Counts". It doesn't return anything, so the
> old
>>>>>> doc
>>>>>>>> comment was better.
>>>>>>>
>>>>>>> Done.
>>>>>>>
>>>>>>>> Also the formatting is funny here and for the function below -
>>>>>>>> clang-format will fix that for you.
>>>>>>>
>>>>>>> Not addressed in the latest patch.
>>>>>>
>>>>>> Why not?
>>>>>>
>>>>>>>>> +  /// Count the number of instrumented value sites for the
>>>> function.
>>>>>>>>> +  void computeNumValueSiteCounts(InstrProfInstrumentTargetInst
>>>> *Ins);
>>>>>> +
>>>>>>>>> +  /// Replace instrprof_instrument_target with a call to runtime
>>>>>> library.
>>>>>>>>> +  void lowerInstrumentTargetInst(InstrProfInstrumentTargetInst
>>>> *Ins);
>>>>>>>>> +
>>>>>>>>>    /// Replace instrprof_increment with an increment of the
>>>>>> appropriate value.
>>>>>>>>>    void lowerIncrement(InstrProfIncrementInst *Inc);
>>>>>>>>> @@ -117,20 +132,39 @@
>>>>>>>>>    bool MadeChange = false;
>>>>>>>>>    this->M = &M;
>>>>>>>>> -  RegionCounters.clear();
>>>>>>>>> +  ProfileDataMap.clear();
>>>>>>>>>    UsedVars.clear();
>>>>>>>>> +  // We did not know how many value sites there would be inside
>>>>>>>>> +  // the instrumented function. This is counting the number of
>>>>>> instrumented
>>>>>>>>> +  // target value sites to enter it as field in the profile data
>>>>>> variable.
>>>>>>>>> +  for (Function &F : M)
>>>>>>>>> +    for (BasicBlock &BB : F)
>>>>>>>>> +      for (auto I = BB.begin(), E = BB.end(); I != E;)
>>>>>>>>> +        if (auto *Ind =
>>>> dyn_cast<InstrProfInstrumentTargetInst>(I++))
>>>>>>>>> +          computeNumValueSiteCounts(Ind);
>>>>>>>>
>>>>>>>> The frontend should pass the number of value sites in the function
>>>>>>>> in
>>>>>>>> the arguments to the intrinsic, like we do for the number of
>>>>>>>> counters
>>>>>> in
>>>>>>>> the increment intrinsic. In addition to being simpler, it's more
>>>> robust
>>>>>>>> against things like dead code elimination which could remove the
>>>>>>>> call
>>>>>> to
>>>>>>>> an intrinsic completely.
>>>>>>>
>>>>>>> Unfortunately, this is not straightforwardly doable. We do not know
>>>> the
>>>>>>> number of target expressions we'd like to profile prior to
> inserting
>>>> the
>>>>>>> intrinsics. This is due to the fact that AST level expressions are
>>>>>>> not
>>>>>> one
>>>>>>> to one mappable to the generated LLVM IR level expressions. In the
>>>> case
>>>>>> of
>>>>>>> indirect calls for instance the number of calls generated per a
>>>>>>> single
>>>>>>> CXXDestructor could be more than one.
>>>>>>>
>>>>>>> Instead the current implementation assigns an index by incrementing
>>>> the
>>>>>>> NumValueSites counter at each point where a target expression is
>>>>>> detected.
>>>>>>> The index value is passed along to the InstrProfiling.cpp as an
>>>> argument
>>>>>>> of the intrinsic. Then NumValueSites is determined by taking the
> max
>>>> of
>>>>>>> all indices recorded for a function + 1.
>>>>>>
>>>>>> Getting the indices by incrementing NumValueSites as you go leads to
> a
>>>>>> few problems - this is actually how I initially implemented the
>>>>>> counter
>>>>>> indices, but changed it after finding that it doesn't actually work.
>>>>>>
>>>>>> The problems occur if a call site is eliminated before we get to the
>>>>>> lowering, which can happen today from certain types of dead code
>>>>>> elimination and may become more common as we make improvements to
> how
>>>>>> lowering and optimizations interact. Numbering in the backend means
>>>> that
>>>>>> the counter indices and even the number of counters may change
>>>> depending
>>>>>> on optimization and essentially disassociates the results from the
> PGO
>>>>>> hash and the frontend locations. This loses any advantages of doing
>>>>>> instrumentation in the frontend, and can easily lead to
> unintentional
>>>>>> pessimization when we try to use the data since it's no longer
> mapped
>>>>>> correctly.
>>>>>>
>>>>>>>>> +
>>>>>>>>>    for (Function &F : M)
>>>>>>>>>      for (BasicBlock &BB : F)
>>>>>>>>>        for (auto I = BB.begin(), E = BB.end(); I != E;)
>>>>>>>>>          if (auto *Inc = dyn_cast<InstrProfIncrementInst>(I++)) {
>>>>>>>>>            lowerIncrement(Inc);
>>>>>>>>>            MadeChange = true;
>>>>>>>>>          }
>>>>>>>>> +
>>>>>>>>> +  for (Function &F : M)
>>>>>>>>> +    for (BasicBlock &BB : F)
>>>>>>>>> +      for (auto I = BB.begin(), E = BB.end(); I != E;)
>>>>>>>>> +        if (auto *Ind =
>>>> dyn_cast<InstrProfInstrumentTargetInst>(I++))
>>>>>> {
>>>>>>>>> +          lowerInstrumentTargetInst(Ind);
>>>>>>>>> +          MadeChange = true;
>>>>>>>>> +        }
>>>>>>>>
>>>>>>>> It seems like we could do both of these in one loop, no?
>>>>>>>
>>>>>>> Not addressed in this patch.
>>>>>>
>>>>>> Why not?
>>>>>>
>>>>
>>>
>>>
>>>
>