[PATCH] D27642: DebugInfo: Added support for Checksum debug info feature (LLVM IR part)

Thu Dec 15 14:24:52 PST 2016

> On 2016-Dec-15, at 11:03, Mehdi Amini <mehdi.amini at apple.com> wrote:
> 
>> 
>> On Dec 13, 2016, at 3:19 PM, Duncan P. N. Exon Smith <dexonsmith at apple.com> wrote:
>> 
>>> 
>>> On 2016-Dec-13, at 10:02, Reid Kleckner via Phabricator <reviews at reviews.llvm.org> wrote:
>>> 
>>> rnk added inline comments.
>>> 
>>> 
>>> ================
>>> Comment at: docs/LangRef.rst:4018
>>> +                     data: "000102030405060708090a0b0c0d0e0f")
>>> +
>>> .. _DIBasicType:
>>> ----------------
>>> aprantl wrote:
>>>> mehdi_amini wrote:
>>>>> rnk wrote:
>>>>>> aprantl wrote:
>>>>>>> aaboud wrote:
>>>>>>>> mehdi_amini wrote:
>>>>>>>>> aaboud wrote:
>>>>>>>>>> mehdi_amini wrote:
>>>>>>>>>>> aaboud wrote:
>>>>>>>>>>>> aprantl wrote:
>>>>>>>>>>>>> aaboud wrote:
>>>>>>>>>>>>>> mehdi_amini wrote:
>>>>>>>>>>>>>>> rnk wrote:
>>>>>>>>>>>>>>>> aaboud wrote:
>>>>>>>>>>>>>>>>> mehdi_amini wrote:
>>>>>>>>>>>>>>>>>> What are the valid list of "type"? What type is the "type" field? Why the prefix `ChecksumType_` and not just `MD5`?
>>>>>>>>>>>>>>>>>> Why not a simpler approach where the type is an MDString? (That already how you parse it in DIChecksum::getChecksumType anyway)
>>>>>>>>>>>>>>>>> Right now the valid list of types are:
>>>>>>>>>>>>>>>>> ChecksumType_MD5
>>>>>>>>>>>>>>>>> ChecksumType_SHA1
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I will add this to the document.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Answering your other questions:
>>>>>>>>>>>>>>>>> 1. The current solution does not handle the type as string, but as an enumeration with a known token (see LLexer.cpp)
>>>>>>>>>>>>>>>>> 2. Omitting the prefix ChecksumType_ will not allow us to recognize the token.
>>>>>>>>>>>>>>>>> 3. This is similar to what is done with DIFlag...
>>>>>>>>>>>>>>>>> 4. Allowing the user to write the type as string "MD5" is easier to implement, however, checking invalid types will not be elegant as this solution.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Bottom line, I am ready to go with either solutions.
>>>>>>>>>>>>>>>>> Let's hear what others think.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I think we could fold DIChecksum into DIFile. It would look like:
>>>>>>>>>>>>>>>> !DIFile(filename: ..., checksumkind: ChecksumType_MD5, checksum: !"deadbeef")
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> This adds a dead integer to DIFile when checksums aren't being used, but it's probably more efficient and less work than having a separate node.
>>>>>>>>>>>>>>> I rather have a much simpler implementation, the validation can be done in the verifier easily and it would handle similarly in-memory validation, bitcode and textual IR (the first two are the most important by the way, textual IR is not what we should optimize for).  
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> With the change that @rnk suggested that would make it quite a smaller patch I believe.
>>>>>>>>>>>>>> Fine with me.
>>>>>>>>>>>>> I'm in favor of rolling it into DIFile. As Paul noted, DWARF 5 also allows for an MD5 checksum on each file entry in the line table section so I'm expecting this to become the normal case eventually.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> What's the problem with recognizing MD5 as a context-sensitive token in the parser?
>>>>>>>>>>>> It is not only MD5, right? should the LLexer recognize all possible types of checksum, i.e. MD5, SHA1, etc.?
>>>>>>>>>>>> Or like it is implemented in this patch, it only needs to recognize the checksum type prefix (whatever it will be).
>>>>>>>>>>>> 
>>>>>>>>>>>> So, how do you prefer to implement this?
>>>>>>>>>>>> 1. Prefix - (if yes, what should be the prefix?)
>>>>>>>>>>>> 2. Recognize all types
>>>>>>>>>>>> 3. Have a wild string
>>>>>>>>>>> Alternative representation:
>>>>>>>>>>> 
>>>>>>>>>>> `!DIFile(filename: ..., checksum: !"MD5:deadbeef")`
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> Once again, I do not mind how we implement it, I just need to hear an agreement from most of the participants in this review.
>>>>>>>>>> however, I think that with this approach it might complicate the code in CodeViewDebug, as it needs to parse the checksum string to get the type.
>>>>>>>>>> 
>>>>>>>>>> I would like to hear Reid opinion on this suggestion before I go and implement it.
>>>>>>>>> Is there already a patch up that shows the `CodeViewDebug`, so that I can see how it gets more complicated there?
>>>>>>>>> I'd expect it to be a `StringSwitch` on the prefix in the string instead of a switch on the enum value and that's it.
>>>>>>>> Sorry, I do not have this patch, and I am not planing to.
>>>>>>>> I hope that Reid can implement this in CodeViewDebug once I commit this patch.
>>>>>>>> This is why I would like to hear his agreement to your proposal before I go this direction.
>>>>>>> I was skeptical at first about using a string to represent the checksum, but I came around to agree that this format is likely the most straight-forward way to add this to the IR assembler language.
>>>>>>> 
>>>>>>> `!DIFile(filename: ..., checksum: !"MD5:deadbeef") // with checksum being an optional nullable field.`
>>>>>>> 
>>>>>>> We probably still want a DIChecksum wrapper around the string that converts the hex string into an llvm::MD5, etc, and/or stores it as a byte string for a more efficient representation.
>>>>>> I don't think we want a DIChecksum node. If we're concerned about efficiency, the real gain is eliminating the extra allocation.
>>>>>> 
>>>>>> I'd rather separate the checksum kind out and store it in DIFile::SubclassData32. It's consistent with what we do for DW_VIRTUALITY_*. It only adds minor complexity to the AsmParser, which is not on any critical path, and avoids a StringSwitch in actual code that will run on every compilation.
>>>>> Are you really motivated by the efficiency of a single StringSwitch per DIFile at Codegen time?
>>>>> 
>>>>> The complexity is not much in the AsmParser than in the bitcode (backward compatibility, etc.), the explicit "None" instead of just ignoring the field (null), and the validation in the verifier.
>>>> @rnk: To make sure I understand where you're going with your comment: You said you would like to separate out the storage of the checksum kind, but how do you propose to store the checksum data?
>>> @aprantl I'd store it in an MDString. Different checksums have different lengths, so we'll need a separate allocation for that.
>>> 
>>> @mehdi_amini What I really care about is that the code to generate this stuff and move it through LLVM is simple. Remember when Duncan encoded as much debug info as strings as possible before moving to the current DI* class hierarchy? Doing stringly-typed things like !"MD5:deadbeef" feels like that, and given a choice, I'd rather avoid that.
>>> 
>>> Anyway, @aaboud asked for my opinion, so I figured I should spell it out. If you both feel that !"MD5:deadbeef" is better, I don't feel strongly about this and can definitely live with the checksum kind in the string.
>> 
>> Separating the 'kind' of checksum from the 'content' makes sense to me.  And my intuition is to use an enum (in C++ and serialized IR) for the kind, unless there's a concrete reason not to.  Then llvm-as/opt/llc can trivially diagnose errors in the textual IR.  I'd make the 0-enum value "None" or something, skip both fields when writing textual IR if "None", and add a verifier check that "None" implies a checksum of 0.
>> 
>> Regarding my work two years ago, my main goals were to (1) improve performance (compile-time CPU/memory with LTO) and (2) add type safety the debug info IR.  I started with (1), and implemented one or two string-ifying proposals that predated me... but the lack of type safety made it really hard to get things done robustly.  I switched directions and incorporated (2).
> 
> IMO it is a tradeoff between the benefit it provides and the cost of the added boilerplate and complexity everywhere (BitcodeReader/Writer, LLParser, Verifier, etc.). 
> While it makes sense to me to have the “core" of the debug info metadata graph being well structured and typed, I don’t the benefit for leafs such as here (at least far from outweighing the cost). 
> But I’m alone on this, so feel free to proceed!

I think the right fix there is to design away some of the complexity.  Something David Blaikie suggested when I was committing (but I considered out of scope) was building a tablegen description, and moving as much as possible to that.  It could easily handle:
- textual IR reading/writing
- C++ classes
and less trivially could handle:
- bitcode writing
and maybe could even help a bit with:
- bitcode reading
(but the upgrade paths at least would need hand-coding I assume).  With that stuff in place, leaves like this would be less of a burden.