[PATCH] Add warning capabilities in LLVM (backend part)

Sat Nov 16 09:05:26 PST 2013

----- Original Message -----
> From: "Quentin Colombet" <qcolombet at apple.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: reviews+D2190+public+af649f7838db3d64 at llvm-reviews.chandlerc.com, "llvm commits" <llvm-commits at cs.uiuc.edu>,
> "David Blaikie" <dblaikie at gmail.com>, "Eric Christopher" <echristo at gmail.com>, "renato golin"
> <renato.golin at linaro.org>, "Chandler Carruth" <chandlerc at gmail.com>
> Sent: Friday, November 15, 2013 7:51:46 PM
> Subject: Re: [PATCH] Add warning capabilities in LLVM (backend part)
> 
> Hi Hal,
> 
> Thanks for the feedbacks.
> 
> I have replied within your email, but here is the overall approach I
> am tempted to pursue after looking at all your comments.
> 
> A quick reminder first, as discussed in the RFC, the idea was that
> "StringData+Kind" can be used to format something nice in clang, but
> that "msg" fully covers it for clients that don't know the Kind
> enum.
> 
> I would like to keep something similar but, I think it would make
> sense to provide something more structured for StringData,
> especially for statically known case.
> 
> Basically, I would go for a DiagnosticInfo class, whose subclasses
> would provide the expected fields easily to avoid boxing/unboxing
> overhead. The based class would provide one virtual method: toStr(),
> which would implement the default message in each subclass (possibly
> an argument of the subclass constructor).
> 
> Then, based on the Kind+Severity the front-end can statically cast
> this DiagnosticInfoClass to the right type.
> In particular, in the case of a InlineAsmDiagnosticInfo, this class
> would hold a message and a location and the front-end would do what
> it was doing with the InlineAsmDiag.

I think that you're on the right track with this, and I feel that this is essentially the right solution. You should set this up just like other class hierarchies in LLVM, so that we can use the cast/isa/dynamic_cast templates with them. A good (simple) example of how to do this in SCEV: llvm/Analysis/ScalarEvolution.h defines SCEV, and then llvm/Analysis/ScalarEvolutionExpressions.h defines SCEV's subclasses. There is a enum that defines the type IDs, which is stored in the base class, and then each subclass overrides classof as appropriate.

Doing it this way, the 'Kind' gets replaced by the type ID, and so you don't explicitly need it in the interface. You might also want to make Severity a member of this diagnostic class as well. At this point, the interface is very simple, is type safe, and does not involve the frontend parsing of string data (which is something I'd really like to avoid).

Regarding the messages, let me explain what I'd like to avoid:

 1. Defining a new string-list serialization format: First, the first thing we'd end up doing was defining a function in the frontend to parse this out into an array (or hash table or whatever), so we might as well pass it that way. Second, the parsing would end up being non-trivial: We'd need a way to escape the separator, and then we'd need to define a way of handling binary data (maybe hex encode it?). All of this gets complicated, and so we'd need more utility functions on the encoding side as well. Not to mention that, with this design, the frontend is always limited to exactly the information the backend decides to serialize (and I'm not sure that's the best setup). In any case, I think that with your proposal above, this can be fortunately irrelevant: all of this data can be passes as class member of the diagnostic subclass, and we don't need this serialization at all.

 2. Practical duplication of similar code in the frontend and backend. Thinking through how this would work in practice, for every new diagnostic, I'd first add code to the backend to produce the default message and store that same data somehow:

    std::string DataStorage;
    raw_string_ostream Data(DataStorage);
    Data << Something << "#" << SomethingElse << "#" << hexencode(AThirdThing);

    std::string MsgStorage("This is not quite what you has in mind (");
    raw_string_ostream Msg(MsgStorage);
    Msg << Something << " which calls " << SomethingElse << " with a buffer of size " <<
           AThirdThing.size() << ") in " << Fn.getName();
    F->getContext().report(..., Data.str(), LLVMContext::RS_Warning, Msg.str());

Then I'd need to go into Clang and add code somewhere that looks like this:

  SmallVector<StringRef> DataStrings;
  unparse_diag_data(Data, DataStrings);

  S.Diag(...) << "This is not quite what you has in mind (" << DataStrings[0] <<
                 " which calls " << demangle(DataStrings[1]) << " with a buffer of size " <<
                   DataStrings[2].size() << ")";

[It would not quite be like this, the fixed string parts of the message would be assigned a message id with %1, %2, etc. inside, but you get the idea.]

For a diagnostic generated from multiple places in the backend, we'd end up defining utility functions to generate the text and the data string. We'd still need to keep the frontend and the backend in sync, which would be a pain.

So what I'd like is for the diagnostic class itself to have a virtual print() method that uses a printer class to do something. This method takes as an argument a printer class (which has virtual overloads of operator << for the basic types and for Value *, SCEV *, MachineInst *, etc.). There is a default printer class (which just uses raw_string_ostream), but the frontend can provide one that does other things. The idea is that, in Clang, we can construct one of these printer implementations that forwards things to Clang's DiagnosticBuilder (in include/clang/Basic/Diagnostic.h). For Clang, for example, the overload that prints a Value* will check to see if it is a Function*, for example, and then print the demangled function name, or if it is an Instruction *, might do something special if it has srcloc metadata, etc. Can the frontend check for a particular diagnostic class, dyn_cast it, and then directly use its data? Sure, but I doubt that will be the common case.

I think that, if we do it this way, we can have an easy-to-maintain system that lets us add backend diagnostics without touching the frontend each time, and, at the same time, let's us produce backend diagnostics that *look like* frontend diagnostics (which as many of the niceties of frontend diagnostics as possible).

Thanks again,
Hal

> 
> In the “Other” case, the subclass can return an array of Value* and a
> formatted string. The default implementation of
> OtherDiagnosticInfo::toString would walk over this array and dump
> the value at the right place.
> 
> However, I do not want to push the array of Value* and the formatted
> string into the generic class. My goal is to keep the
> formatting/localization burden for known cases in the front-end,
> like we already do.
> 
> ** New Proposal **
> 
> - Define a class DiagnosticInfo that has one virtual method toStr().
> - Have a subclass of DiagnosticInfo per Kind+Severity.
> - Have the front-end knows the formatting of each Kind+Severity and
> use the right fields of the DiagnosticInfo subclass accordingly.
> - For “Other" cases, the front-end either use
> OtherDiagnosticInfo::toStr or use the pair ArrayRef<Value*>,
> StringRef Format.
> 
> Finally the reporting API would be:
> void LLVMContext::report(DiagnosticInfo *, Kind, Severity)
> The default message would be given by toStr().
> 
> Thoughts?
> 
> On Nov 15, 2013, at 12:06 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> 
> > ----- Original Message -----
> >> From: "Quentin Colombet" <qcolombet at apple.com>
> >> To: dblaikie at gmail.com, echristo at gmail.com, "renato golin"
> >> <renato.golin at linaro.org>, chandlerc at gmail.com,
> >> qcolombet at apple.com
> >> Cc: llvm-commits at cs.uiuc.edu, hfinkel at anl.gov
> >> Sent: Friday, November 15, 2013 1:18:22 PM
> >> Subject: Re: [PATCH] Add warning capabilities in LLVM (backend
> >> part)
> >> 
> >> 
> >>  Hi Hal,
> >> 
> >>  Thanks for the feedback.
> >>  I suggest that you read the RFC regarding the choices for the
> >>  interface.
> >>  What you are proposing has IIRC already been discussed.
> > 
> > In part, yes, but not exactly, and I think it makes sense to have a
> > more-targeted discussion now that we have an implementation to
> > play with.
> Fair enough :).
> 
> > 
> >> 
> >>  Let me try to summarize why we did not choose what you are
> >>  proposing:
> >>  1. srcloc metadata.
> >>  We cannot anticipate what would be reported and in particular, a
> >>  source location may not be adapted. Assume you want to report why
> >>  the vectorizer failed to optimize a given loop. Having only one
> >>  source location will not be enough. So why should we be bound to
> >>  one source location?
> > 
> > No, my point is:
> > 
> > 1. We need the srcloc data to deal with inline asm messages (this
> > is an existing requirement).
> I know, but it does not mean we need an explicit argument for that,
> al least not for the front-end call-back.
> Indeed, we could provide in the “reporting API’ an additional
> prototype with the srcloc (unsigned, not MDNode*) or the
> Instruction, like for LLVMContext::emitError, that maps to the
> proposed LLVMContext::report encoding the srcloc in the data
> argument.
> 
> That way, we are not bound to the one srcloc requirement, and can
> still support several lines requirement.
> 
> > 
> > 2. Because the mechanism is completely general (although currently
> > used only for inline asm), it makes sense for it to be a
> > first-class part of the interface.
> The mechanism is not supposed to be used by inline asm only. We use
> it for the internal stack-size warning, which does not report a
> srcloc (although this may be a mistake, I am the one to blame at the
> moment :)) and there are several spots where we call
> LLVMContext::emitError without a srcloc.
> Therefore, I am not sure the srcloc must be part of the interface or
> of the generic DiagnosticInfo I am proposing.
> 
> > Might we want to extend this to deal with more general situation
> > later? Maybe. But we need at least one srcloc for inline asm
> > messages, and we can discuss how to deal with multiple-location
> > messages as we have concrete use cases (although if we have a
> > value array, we can also put them in there).
> I am a bit reluctant to go in that direction: i.e., specializing the
> API for InlineAsm, then thinking generalization if needed.
> In particular, I do not like the “if needed” part.
> 
> To me, we may already not be aware of the generalization needs, I am
> thinking plug-ins. That said, I prefer to have the general API, even
> if it means spending more, useful :), time discussing, than
> introducing a new API that we will have to support for an unknown
> amount of time.
> 
> If you look at what we currently have, the existing InlineAsmDiag API
> works well for our purpose but has been stuck to that purpose. I
> would like to directly address the general case, otherwise, why do
> not we just extend the InlineAsmDiag?
> 
> Regarding your point for Values, I agree that if we want to pass
> Values the current API is not great.
> I think what I am proposing with the DiagnosticInfo could get the
> consensus here.
> 
> > 
> >> 
> >>  2. StringRef vs. ArrayRef<Value*>
> >>  First I will quote David Blaikie and Chris Lattner on StringRef
> >>  vs.
> >>  ArrayRef<StringRef>:
> >>  David: “Presumably there are diagnostics that are going to have
> >>  more than one parameter, though. ArrayRef<StringRef>?”
> >>  Chris: “It's possible, but the full generality of any diagnostic
> >>  can be folded (at some loss of generality) into a single string.
> >>   To balance complexity and generality vs simplicity, "1" is a
> >>  pretty decent number.”
> > 
> > Fair enough, and in terms of ArrayRef<StringRef> I completely
> > agree, but:
> > 
> > 1. We already have a message string (and I don't think that we need
> > more).
> They have different purpose:
> - One carries a default printing message.
> - The other carries the information to build a user report.
> 
> Let me try to illustrate the concept.
> Let us assume we are reporting a loop that cannot be optimized by the
> vectorizer.
> The default message could be: “Cannot optimize loop <llvm internal
> label> because of weird data dependencies”.
> The StringData could be:
> “<Function>##<SrcLocOfLoopHeader>##<dataDepencyKind>#<SrcLoc1>#<SrcLoc2>#<SrcLoc3>##<dataDepencyKind>#<SrcLoc1>#<SrcLoc3>##”.
> 
> The front-end would then produce a much more user friendly message,
> or point-out the interesting part in an IDE.
> 
> The introduction of the DIagnosticInfo intends to avoid the
> serialization/deserialization overhead.
> 
> > 
> > 2. Value* also gives us constants, metadata, and essentially
> > everything else. Even the backend often has values from things in
> > the MMO structure.
> Agree, but do we want to pay the boxing/unboxing cost?

To be honest, I think this is irrelevant. Error reporting is not a hot-path operation: the goal is to make the messages as useful as possible to the user.

> 
> > 
> > 3. The backend is the wrong place to do string formatting for user
> > consumption. Should it demangle function names too? ;)
> No, of course not :).
> The point of the default string was not really for user consumption
> but as a *default* message if the front-end is not able or does not
> want to produce something from the StringData.
> 
> > 
> >> 
> >>  Now, why using StringRef instead of Value, I see two reasons for
> >>  that:
> >>  - I believe we do not want the diagnostic system of the front-end
> >>  to have to deal with the IR.
> > 
> > - The IR is the interface between the frontend and the backend.
> > Will the frontend know what to do with some arbitrary Value*?
> > Probably not. That's not the point. That Value* can be a
> > ConstantInt, or a MDNode (maybe a srcloc, maybe a loop id, some
> > debug metadata, etc.), a global variable, and many other things.
> > Only the frontend knows how to format those things reasonably in
> > the context of the source language.
> Using my running example, you have to use tricks in the array of
> Value* (like in the StringData, BTW) to explain to the front-end
> that the first 3 values are for the first dependency and the next
> two other for the second dependency.
> My concern here is that an array may not be structured enough for
> nice exposure with variable length arguments for instance.
> We can argue that if we feature a formatting string this is not a
> problem, but I do not like that because I think we rather have the
> formatting string into the front-end for known report.
> 
> I understand that my example is kind of hand waving, but my goal is
> to expose these problematics so that we can come up with the best
> solution. I certainly do not pretend to have the right design in
> mind :).
> 
> 
> > 
> >>  - What happens for a report coming for the MC level (e.g.,
> >>  AsmParser), IIRC, there is no Value involved here.\
> > 
> > - The backends often do have values (from MMO), and can certainly
> > use IR interfaces to create constants, metadata, etc. as
> > necessary. I'd rather pass constants as constants, than having to
> > parse them again later out of a string.
> You made a point :).
> 
> > 
> >> 
> >>  3. Default message printer.
> >>  The idea of the default message printer is to provide a directly
> >>  usable message.
> > 
> > - I think that we should provide a utility function that will take
> > the string + the value array and produce a default serialization
> > of the message. Any frontend that does not really care too much
> > (or opt, for example), can use that.
> Assuming everything fits into a Value, that could work.
> However, I am not a huge fan of that, because the producer of the
> report has to do some formatting/localization. (See my example where
> the default message is more like an assert message with no
> parameters).
> > 
> >> I believe we do not what to push in every
> >>  front-end a parsing of the default message.
> > 
> > - I agree, but we really should give them that option.
> Well, like I said, I am not a huge fan, but if we can reach a
> consensus on that, why not.
> 
> > 
> > Moreover, like I said,
> >>  I think we may not be able to produce Value objects every time.
> > 
> > - I disagree, we can always produce value objects -- value objects
> > can be metadata strings, and are strictly more expressive than
> > strings.
> Fair enough, but again, I am not sure we want to pay the boxing cost
> and that it is structured enough for known cases.
> 
> > 
> >> 
> >>  What do you think?
> > 
> > There you have it ;)
> Thanks :).
> 
> -Quentin
> 
> > 
> > Thanks again,
> > Hal
> > 
> >> 
> >>  Cheers,
> >>  Quentin
> >> 
> >> http://llvm-reviews.chandlerc.com/D2190
> >> 
> 
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory