[PATCH] LLVM changes for indirect call target profiling support
Betul Buyukkurt
betulb at codeaurora.org
Mon Jun 15 22:06:45 PDT 2015
> "Betul Buyukkurt" <betulb at codeaurora.org> writes:
>>>> "Betul Buyukkurt" <betulb at codeaurora.org> writes:
>>>>> Hi Justin,
>>>>>
>>>>> Since value profiling changes are very large in size and complexity
>>>>> wise
>>>>> required multiple design iterations (i.e. IC-profiling -> value
>>>> profiling
>>>>> -> raw file size reduction), I did not reply to much of the minor
>>>>> commentary since the design changes usually resulted in those fixes
>>>> being
>>>>> changed again anyway. So I saved any of those typo, naming, wording,
>>>>> doc
>>>>> extending etc related comments addressing for merge time. Today I've
>>>>> posted first of my merge intended CL's for review where all of your
>>>>> concerns are addressed.
>>>>
>>>> I'm not sure why we need yet another thread for that. In any case, two
>>>> things:
>>>
>>> I thought splitting the LLVM change into smaller CL's was what you
>>> proposed when you suggested (quoting from your email):
>>>
>>> "Finally, the LLVM patch is quite large. We can probably make it a bit
>>> easier to stage and review by splitting out a couple of things and
>>> doing
>>> them first:
>>>
>>> 1. Changing the __llvm_profile_data variables so they're no longer
>>> const.
>>>
>>> 2. Adding the InstrProfRecord type and updating the unittests. Also,
>>> this type should just go in the InstrProf.h header, no need for its
>>> own.
>>>
>>> 3. Arguably, we could update the raw format and implement the
>>> RawInstrProfReader changes (but ignoring the new data) first. If you
>>> think this is worthwhile go ahead, but if it'll lead to too much
>>> extra busywork it probably isn't worth it.
>>> "
>>>
>>> So I came up with another plan which I posted later in the same thread.
>>> Do
>>> you have any objections to the newer plan?
>
> Splitting up the patches I approve of - it just seems like there are ten
> different email threads for the 3 or 4 patches we've already been
> discussing ;) Sorry if I sound like I'm contradicting myself.
>
>> Below list is what I had proposed as the merge. I'm eager to iron out
>> all
>> the problems and merge these changes soon. I think merging the
>> intrinsics
>> alone is not a problem because by the third patch the lowering and the
>> tests will all come together. The goal is to share manageable size CL's
>> for final code review.
>>
>> 1. Intrinsic instruction for value profiling.
>> 2. Making changes to the data structures used by the readers/writers,
>> and
>> updating of the readers to preserve the current formats.
>> 3. Indexed reader/writer changes for version 3 i.e. the one that's going
>> to
>> be used by the value profiling as well
>> 4. Lowering of profiling intrinsic and generation of Data w/ value
>> profiling
>> entries. Raw reader updates for reading version 2 data.
>
> There isn't much value in doing (1) on its own, so I'd probably do it
> with (4) if we're doing things in this order. See below though.
>
>> 5. compiler-rt changes for value profiling.
>>
>> (4 & 5) changes should go in together not to break the builds.
>>
>> 6. Clang changes for generating instrumentation and attaching of new
>> profile
>> metadata.
>>
>> 7. Clang flag controls for value profiling types (enabling/disabling
>> instrumentation/pgo use).
>
> Note that (5) is hard to test without the changes in (7), due to how the
> tests in compiler-rt are generally integration-style "run the whole
> compiler" type tests today. As such, maybe an order like this works
> (I'll use letters so I can refer to your numbering from above):
>
> A) Do (2) and change the reader writer data structures. This is
> basically NFC refactoring, so it's relatively independent.
>
> B) Implement the indexed changes in (3). These should be independently
> testable, probably with unit tests.
>
> C) Do (1) in LLVM and (6) in clang. The intrinsic won't do anything yet,
> but we can write clang tests that check that it shows up in the right
> places, so it isn't useless.
>
> D) Finish off with (7), (4), and (5). You need the changes in (7) so you
> can add tests in (5), and the changes in (4) so that (7) actually
> does anything.
>
> Seven patches, four steps.
>
> There might be one problem with this - there isn't currently a way to
> turn off the instrprof lowering pass, so the tests in (C) will break
> when we do (D). We should fix this so that clang can just test that it
> emits the intrinsics correctly by disabling the lowering.
>
> What do you think?
i) let's resolve the value_kind problem. That'd cause a major change in
the overall design and run-time.
ii) Let's still review (1). I think it's good to go.
-Betul
>>
>> Thanks,
>> -Betul
>>
>>> -Betul
>>>
>>>>
>>>> 1. There isn't much point in checking in the addition of the intrinsic
>>>> separate from the lowering of the intrinsic, since the first patch
>>>> doesn't actually do anything in that case.
>>>> 2. My comment about precalculating the indices below seems to still
>>>> need
>>>> discussion.
>>>>
>>>>> Thanks,
>>>>> -Betul
>>>>>
>>>>>> "Betul Buyukkurt" <betulb at codeaurora.org> writes:
>>>>>>>>> +'``llvm.instrprof_instrument_target``' Intrinsic
>>>>>>>>> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>>>>>>> +
>>>>>>>>> +Syntax:
>>>>>>>>> +"""""""
>>>>>>>>> +
>>>>>>>>> +::
>>>>>>>>> +
>>>>>>>>> + declare void @llvm.instrprof_instrument_target(i8* <name>,
>>>> i64
>>>>>> <hash>,
>>>>>>>>> + i32
>>>>>> <value_kind>,
>>>>>>>>> + i64
>>>>>>>>> <value>,
>>>> i32
>>>>>> <index>)
>>>>>>>>> +Overview:
>>>>>>>>> +"""""""""
>>>>>>>>> +
>>>>>>>>> +The '``llvm.instrprof_instrument_target``' intrinsic can be
>>>>>>>>> emitted
>>>>>> by a
>>>>>>>>> +frontend for use with instrumentation based profiling. This will
>>>>>>>>> be
>>>>>>>>> +lowered by the ``-instrprof`` pass to find out the target
>>>>>>>>> values,
>>>>>>>>> +instrumented expressions take in a program at runtime.
>>>>>>>>> +
>>>>>>>>> +Arguments:
>>>>>>>>> +""""""""""
>>>>>>>>> +
>>>>>>>>> +The first argument is a pointer to a global variable containing
>>>>>>>>> the
>>>>>>>>> +name of the entity being instrumented. ``name`` should generally
>>>>>>>>> be
>>>>>> the
>>>>>>>>> +(mangled) function name for a set of counters.
>>>>>>>>> +
>>>>>>>>> +The second argument is a hash value that can be used by the
>>>> consumer
>>>>>>>>> +of the profile data to detect changes to the instrumented
>>>>>>>>> source.
>>>> It
>>>>>>>>> +is an error if ``hash`` differs between two instances of
>>>>>>>>> +``instrprof_instrument_target`` that refer to the same name.
>>>>>>>>> +
>>>>>>>>> +The third argument represents the kind of the value profiling
>>>>>>>>> that
>>>> is
>>>>>> done.
>>>>>>>>> +The fourth argument is the value of the expression being
>>>>>>>>> profiled.
>>>>>> The last
>>>>>>>>> +argument is the index of the instrumented expression within
>>>> ``name``.
>>>>>>>>> +It should be >= 0.
>>>>>>>>
>>>>>>>> These arguments need further explanation.
>>>>>>>
>>>>>>> Not addressed in the latest patch.
>>>>>>
>>>>>> Please do explain these.
>>>>>>
>>>>>>>>> + data_type ReadDataV3(StringRef K, const unsigned char *D,
>>>>>> offset_type N) {
>>>>>>>>> using namespace support;
>>>>>>>>> - // We just treat the data as opaque here. It's simpler to
>>>> handle
>>>>>> in
>>>>>>>>> - // IndexedInstrProfReader.
>>>>>>>>> - unsigned NumEntries = N / sizeof(uint64_t);
>>>>>>>>> - DataBuffer.reserve(NumEntries);
>>>>>>>>> - for (unsigned I = 0; I < NumEntries; ++I)
>>>>>>>>> - DataBuffer.push_back(endian::readNext<uint64_t, little,
>>>>>> unaligned>(D));
>>>>>>>>> - return data_type(K, DataBuffer);
>>>>>>>>> + const unsigned char *Start = D;
>>>>>>>>> + if (D - Start + sizeof (uint64_t) > N)
>>>>>>>>> + return data_type();
>>>>>>>>> + uint64_t DataBufferSize = endian::readNext<uint64_t, little,
>>>>>> unaligned>(D);
>>>>>>>>> + DataBuffer.clear();
>>>>>>>>> + DataBuffer.reserve(DataBufferSize);
>>>>>>>>> + std::vector<uint64_t> CounterBuffer;
>>>>>>>>> + std::vector<InstrProfValueRecord> ValueBuffer;
>>>>>>>>> + for (unsigned I = 0; I < DataBufferSize; ++I) {
>>>>>>>>> + if (D - Start + sizeof (uint64_t) + sizeof(uint32_t) >= N)
>>>>>>>>> + return data_type();
>>>>>>>>> + uint64_t Hash = endian::readNext<uint64_t, little,
>>>>>> unaligned>(D);
>>>>>>>>> + uint32_t CountsSize = endian::readNext<uint32_t, little,
>>>>>> unaligned>(D);
>>>>>>>>> + if (D - Start + CountsSize * sizeof(uint64_t) >= N)
>>>>>>>>> + return data_type();
>>>>>>>>> + CounterBuffer.clear();
>>>>>>>>> + for (unsigned I = 0; I < CountsSize; ++I)
>>>>>>>>> + CounterBuffer.push_back(endian::readNext<uint64_t,
>>>>>>>>> little,
>>>>>> unaligned>(D));
>>>>>>>>> +
>>>>>>>>> + DataBuffer.push_back(InstrProfRecord(K, Hash,
>>>> CounterBuffer));
>>>>>> +
>>>>>>>>> + for (uint32_t Kind = instr_value_prof_kind::first;
>>>>>>>>> + Kind < instr_value_prof_kind::last; ++Kind) {
>>>>>>>>> + if (D - Start + 2 * sizeof(uint32_t) > N)
>>>>>>>>> + return data_type();
>>>>>>>>> + uint32_t ValueKind = endian::readNext<uint32_t, little,
>>>>>> unaligned>(D);
>>>>>>>>> + if (ValueKind != Kind)
>>>>>>>>> + return data_type();
>>>>>>>>> + uint32_t ValuesSize = endian::readNext<uint32_t, little,
>>>>>> unaligned>(D);
>>>>>>>>> + ValueBuffer.clear();
>>>>>>>>> + for (unsigned I = 0; I < ValuesSize; ++I) {
>>>>>>>>> + if (D - Start + sizeof(uint32_t) >= N) return
>>>> data_type();
>>>>>>>>> + uint32_t NameSize = endian::readNext<uint32_t, little,
>>>>>> unaligned>(D);
>>>>>>>>> + if (D - Start + NameSize * sizeof(char)>= N)
>>>>>>>>> + return data_type();
>>>>>>>>> +
>>>>>>>>> + std::string Name((const char *)D, NameSize);
>>>>>>>>> + D += NameSize;
>>>>>>>>> +
>>>>>>>>> + if (D - Start + 2 * sizeof(uint64_t) >= N)
>>>>>>>>> + return data_type();
>>>>>>>>> + uint64_t Value = endian::readNext<uint64_t, little,
>>>>>> unaligned>(D);
>>>>>>>>> + uint64_t NumTaken = endian::readNext<uint64_t, little,
>>>>>> unaligned>(D);
>>>>>>>>> +
>>>>>>>>> + if (D - Start + sizeof(uint32_t) > N)
>>>>>>>>> + return data_type();
>>>>>>>>> + uint32_t CounterIndex = endian::readNext<uint32_t,
>>>> little,
>>>>>> unaligned>(D);
>>>>>>>>> + ValueBuffer.push_back(
>>>>>>>>> + InstrProfValueRecord(Name, Value, NumTaken,
>>>>>> CounterIndex));
>>>>>>>>> + }
>>>>>>>>> + DataBuffer[DataBuffer.size()-1].Values[Kind] =
>>>>>>>>> ValueBuffer;
>>>>>>>>> + }
>>>>>>>>
>>>>>>>> I guess this'll be simpler without having to deal with the kinds,
>>>>>>>> but
>>>>>>>> there's also a lot of "D - Start + ..." math in here that I think
>>>> would
>>>>>> be
>>>>>>>> easier to understand with a few more named variables.
>>>>>>>
>>>>>>> Not addressed in the latest patch.
>>>>>>
>>>>>> I guess this is waiting until we figure out how we're going forward
>>>> with
>>>>>> the issue about the kinds, but I think you can make the math clearer
>>>>>> in
>>>>>> the meantime regardless.
>>>>>>
>>>>>>>>> + }
>>>>>>>>> + return DataBuffer;
>>>>>>>>> }
>>>>>>>>> };
>>>>>>>>> +
>>>>>>>>> typedef OnDiskIterableChainedHashTable<InstrProfLookupTrait>
>>>>>>>>> InstrProfReaderIndex;
>>>>>>>>> @@ -267,8 +409,6 @@
>>>>>>>>> std::unique_ptr<InstrProfReaderIndex> Index;
>>>>>>>>> /// Iterator over the profile data.
>>>>>>>>> InstrProfReaderIndex::data_iterator RecordIterator;
>>>>>>>>
>>>>>>>> This name is a bit confusing now - it iterates over arrays of
>>>> records,
>>>>>>>> not individual ones.
>>>>>>>
>>>>>>> Not addressed in the latest patch.
>>>>>>
>>>>>> Why not?
>>>>>>
>>>>>>>>> - /// Offset into our current data set.
>>>>>>>>> - size_t CurrentOffset;
>>>>>>>>> /// The file format version of the profile data.
>>>>>>>>> uint64_t FormatVersion;
>>>>>>>>> /// The maximal execution count among all functions.
>>>>>>>>> @@ -278,7 +418,7 @@
>>>>>>>>> IndexedInstrProfReader &operator=(const IndexedInstrProfReader
>>>>>>>>> &)
>>>> =
>>>>>>>>> delete;
>>>>>>>>> public:
>>>>>>>>> IndexedInstrProfReader(std::unique_ptr<MemoryBuffer>
>>>>>>>>> DataBuffer)
>>>>>>>>> - : DataBuffer(std::move(DataBuffer)), Index(nullptr),
>>>>>>>>> CurrentOffset(0) {}
>>>>>>>>> + : DataBuffer(std::move(DataBuffer)), Index(nullptr) {}
>>>>>>>>> /// Return true if the given buffer is in an indexed instrprof
>>>>>>>>> format.
>>>>>>>>> static bool hasFormat(const MemoryBuffer &DataBuffer);
>>>>>>>>> @@ -288,9 +428,16 @@
>>>>>>>>> /// Read a single record.
>>>>>>>>> std::error_code readNextRecord(InstrProfRecord &Record)
>>>>>>>>> override;
>>>>>>>>> - /// Fill Counts with the profile data for the given function
>>>> name.
>>>>>>>>> + /// Return profile data for the given function name and hash
>>>>>>>>> std::error_code getFunctionCounts(StringRef FuncName, uint64_t
>>>>>> FuncHash,
>>>>>>>>> - std::vector<uint64_t>
>>>>>>>>> &Counts);
>>>>>>>>> + std::vector<uint64_t> &Counts);
>>>>>>>>
>>>>>>>> It really does fill "Counts". It doesn't return anything, so the
>>>>>>>> old
>>>>>> doc
>>>>>>>> comment was better.
>>>>>>>
>>>>>>> Done.
>>>>>>>
>>>>>>>> Also the formatting is funny here and for the function below -
>>>>>>>> clang-format will fix that for you.
>>>>>>>
>>>>>>> Not addressed in the latest patch.
>>>>>>
>>>>>> Why not?
>>>>>>
>>>>>>>>> + /// Count the number of instrumented value sites for the
>>>> function.
>>>>>>>>> + void computeNumValueSiteCounts(InstrProfInstrumentTargetInst
>>>> *Ins);
>>>>>> +
>>>>>>>>> + /// Replace instrprof_instrument_target with a call to runtime
>>>>>> library.
>>>>>>>>> + void lowerInstrumentTargetInst(InstrProfInstrumentTargetInst
>>>> *Ins);
>>>>>>>>> +
>>>>>>>>> /// Replace instrprof_increment with an increment of the
>>>>>> appropriate value.
>>>>>>>>> void lowerIncrement(InstrProfIncrementInst *Inc);
>>>>>>>>> @@ -117,20 +132,39 @@
>>>>>>>>> bool MadeChange = false;
>>>>>>>>> this->M = &M;
>>>>>>>>> - RegionCounters.clear();
>>>>>>>>> + ProfileDataMap.clear();
>>>>>>>>> UsedVars.clear();
>>>>>>>>> + // We did not know how many value sites there would be inside
>>>>>>>>> + // the instrumented function. This is counting the number of
>>>>>> instrumented
>>>>>>>>> + // target value sites to enter it as field in the profile data
>>>>>> variable.
>>>>>>>>> + for (Function &F : M)
>>>>>>>>> + for (BasicBlock &BB : F)
>>>>>>>>> + for (auto I = BB.begin(), E = BB.end(); I != E;)
>>>>>>>>> + if (auto *Ind =
>>>> dyn_cast<InstrProfInstrumentTargetInst>(I++))
>>>>>>>>> + computeNumValueSiteCounts(Ind);
>>>>>>>>
>>>>>>>> The frontend should pass the number of value sites in the function
>>>>>>>> in
>>>>>>>> the arguments to the intrinsic, like we do for the number of
>>>>>>>> counters
>>>>>> in
>>>>>>>> the increment intrinsic. In addition to being simpler, it's more
>>>> robust
>>>>>>>> against things like dead code elimination which could remove the
>>>>>>>> call
>>>>>> to
>>>>>>>> an intrinsic completely.
>>>>>>>
>>>>>>> Unfortunately, this is not straightforwardly doable. We do not know
>>>> the
>>>>>>> number of target expressions we'd like to profile prior to
>>>>>>> inserting
>>>> the
>>>>>>> intrinsics. This is due to the fact that AST level expressions are
>>>>>>> not
>>>>>> one
>>>>>>> to one mappable to the generated LLVM IR level expressions. In the
>>>> case
>>>>>> of
>>>>>>> indirect calls for instance the number of calls generated per a
>>>>>>> single
>>>>>>> CXXDestructor could be more than one.
>>>>>>>
>>>>>>> Instead the current implementation assigns an index by incrementing
>>>> the
>>>>>>> NumValueSites counter at each point where a target expression is
>>>>>> detected.
>>>>>>> The index value is passed along to the InstrProfiling.cpp as an
>>>> argument
>>>>>>> of the intrinsic. Then NumValueSites is determined by taking the
>>>>>>> max
>>>> of
>>>>>>> all indices recorded for a function + 1.
>>>>>>
>>>>>> Getting the indices by incrementing NumValueSites as you go leads to
>>>>>> a
>>>>>> few problems - this is actually how I initially implemented the
>>>>>> counter
>>>>>> indices, but changed it after finding that it doesn't actually work.
>>>>>>
>>>>>> The problems occur if a call site is eliminated before we get to the
>>>>>> lowering, which can happen today from certain types of dead code
>>>>>> elimination and may become more common as we make improvements to
>>>>>> how
>>>>>> lowering and optimizations interact. Numbering in the backend means
>>>> that
>>>>>> the counter indices and even the number of counters may change
>>>> depending
>>>>>> on optimization and essentially disassociates the results from the
>>>>>> PGO
>>>>>> hash and the frontend locations. This loses any advantages of doing
>>>>>> instrumentation in the frontend, and can easily lead to
>>>>>> unintentional
>>>>>> pessimization when we try to use the data since it's no longer
>>>>>> mapped
>>>>>> correctly.
>>>>>>
>>>>>>>>> +
>>>>>>>>> for (Function &F : M)
>>>>>>>>> for (BasicBlock &BB : F)
>>>>>>>>> for (auto I = BB.begin(), E = BB.end(); I != E;)
>>>>>>>>> if (auto *Inc = dyn_cast<InstrProfIncrementInst>(I++)) {
>>>>>>>>> lowerIncrement(Inc);
>>>>>>>>> MadeChange = true;
>>>>>>>>> }
>>>>>>>>> +
>>>>>>>>> + for (Function &F : M)
>>>>>>>>> + for (BasicBlock &BB : F)
>>>>>>>>> + for (auto I = BB.begin(), E = BB.end(); I != E;)
>>>>>>>>> + if (auto *Ind =
>>>> dyn_cast<InstrProfInstrumentTargetInst>(I++))
>>>>>> {
>>>>>>>>> + lowerInstrumentTargetInst(Ind);
>>>>>>>>> + MadeChange = true;
>>>>>>>>> + }
>>>>>>>>
>>>>>>>> It seems like we could do both of these in one loop, no?
>>>>>>>
>>>>>>> Not addressed in this patch.
>>>>>>
>>>>>> Why not?
>>>>>>
>>>>
>>>
>>>
>>>
>
More information about the llvm-commits
mailing list