[cfe-dev] How to emit diagnostics?

Fri Jun 12 22:15:01 PDT 2009

On Jun 12, 2009, at 4:03 PM, AlisdairM(public) wrote:
> Finally found time to hack a little code, starting with better  
> detection of
> source encoding.

Ok.

> My plan is to detect a BOM in ContentCache::getBuffer() and  
> initially flag
> an "unsupported encoding" error if a BOM is detected.

I think this is probably too low-level of a place to put this check.   
The clients of getBuffer() should do the check.

> A follow-up patch would then transcode from the detected encoding  
> into a
> UTF-8 buffer before returning from the function, and flag an  
> appropriate
> error if the file then turns out not be the correctly encoded after  
> all.

Can you sketch out your implementation approach?  Will you *just*  
transcode to UTF8, or would you also change UCNs as well?  In my  
(surely naive) view, we'd want to just transcode the file encoding but  
leave UCNs in the file, so that diagnostics print the high characters  
if present and UCNs if present without conflating them.  OTOH, the  
identifier logic would canonicalize both to an internal form when  
looking up the identifier for a token.

Is this workable?

> I can see how the DiagnosticBuilder API works in other code, so I'm  
> happy
> with the idea of re-using an existing "not supported" flag or  
> creating a new
> one.  My problem is that I can't see how to hook a DiagnosticBuilder  
> in the
> first place - there does not seem to be one available from within a  
> member
> function of ContentCache and I'm not sure how else to obtain one.

Is the idea that SourceManager would always return buffers in UTF8  
form?  If the check has to be at this low of a level, I think it would  
be best to make sourcemanager return an error code, and then have the  
preprocessor (and other clients) emit the diagnostic as appropriate.   
Emitting the diagnostic from SM directly would be problematic because  
it doesn't know *why* the file was trying to be read (e.g. in response  
to a #include etc) so it wouldn't be able to produce good diagnostic  
info.

> BTW, I have largely turned my UTF-x support plans on their head, and  
> will
> start with diagnosing bad encodings,

Excellent, building from the bottom up sounds great.

> then support extended characters in
> identifiers (which also means correcting column numbers in  
> diagnostics) and
> UCNs, before adding support for C++0x Unicode types and literals,  
> and then
> (finally!) raw string literals.

If I have a file with:

  \u1234 x

I would expect "x" to have a column number of 9, not of 4.  However,  
if the high character was written as a single high character, I would  
expect it to have a column number of 4.  Do you agree?

Thanks for tackling this Alisdair!

-Chris