[cfe-dev] Fixits with multibyte chars

Mon Jul 16 12:19:30 PDT 2012

One problem with column numbers is that different consumers find different definitions useful. The two main use cases for column numbers are for visual display and for indexing into an array containing some representation of the source in order to do something with that location.

The first case is intimately tied to the specific renderer displaying the source. Depending on the rendering, column numbers could take into account combining characters, wide characters, character's whose width varies (tabs) or is different with different fonts, etc. The second case needs different column numbers depending on how the source is encoded; a 'column number' could index a UTF-8 code unit, a UTF-16 code unit, a codepoint/UTF-32 code unit, or a code unit for some other encoding.

Most of these complexities aren't currently taken into account. Most everywhere 'column number' is an index into the source buffer (though one indexed instead of zero indexed), and that means the value is in UTF-8 code units. I did some work on the console diagnostics display for unicode where I try to convert from source index column numbers to console display column numbers using wcswidth. These console display column numbers aren't used anywhere except to get the range highlighting and fixits to display with the correct alignment. That change was mostly in r154980 / git commit 6749dd50869281.

We need to provide column numbers that are reasonably useful to any consumer. Text renderers will have their own method of calculating column numbers so we just need to provide indexes into the source that anyone can use. Providing column numbers in terms of code units for a particular encoding means a client that uses that encoding can directly index into their source line buffer. That also forces clients using other encodings to go through contortions to use that column number. Alternatively, column numbers in terms of code points should be relatively straightforward for any client using any Unicode encoding to convert that into a column number they can use.

Another option would be to provide column numbers in terms of code units, but give options as to the encoding. So clients that use a particular representation could request column numbers directly relevant to them.

- Seth

On Jul 16, 2012, at 1:32 PM, Jordan Rose wrote:

> Hi, everyone. We recently hit an assertion when trying to output a fixit with Unicode characters in it; it reduces down to this:
> 
> void test() {
>  printf("∆: %d", 1L);
> }
> 
> I could of course just disable fixits when there are Unicode characters involved, but I'd like to fix this the right way. The trouble is -fdiagnostics-parseable-fixits, which is supposed to be machine-readable output, and in this case is a three-byte UTF-8 character three columns or one column? I think one column is the right way to go, but I wanted to get some other opinions before I start working on a patch.
> 
> I'd be getting around to this soon anyway; it's blocking PR13178 (fixit for smart quotes).
> 
> Thanks,
> Jordan
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev