[PATCH] D34793: [lit] Fix some convoluted logic around Unicode encoding, and de-duplicate across modules that used it.

David Jones via llvm-commits llvm-commits at lists.llvm.org
Thu Jun 29 00:15:40 PDT 2017


On Wed, Jun 28, 2017 at 11:16 PM, Zachary Turner <zturner at google.com> wrote:

> I'm curious Why you think file names shouldn't be Unicode. Seems pretty
> reasonable to me?


Well it's certainly not unreasonable, but the underlying encoding is
platform-specific. So it's not always convenient.

For example, it's perfectly reasonable to have filenames which are not
valid UTF-8. In that case, you can just use surrogates to represent the
"invalid" bytes (as Py3 does). Superficially, that just means you need to
jump through more hoops to represent them; but now you have to figure out
what to do with your surrogate-holding string:

>>> os.listdir('/tmp/badfiles')[1]
'file-\udcff'
>>> print(os.listdir('/tmp/badfiles')[1])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in
position 5: surrogates not allowed


And I gather that printing the filename in the terminal is a pretty
valuable target. It's all sort of a mess, frankly. :-/ The lowest common
denominator remains the Posix portable filename character set.

(One interesting tidbit I ran across while double-checking my change was
the Python test for unicode filename handling:

https://github.com/python/cpython/blob/6f0eb93183519024cb360162bdd81b9faec97ba6/Lib/test/test_unicode_file_functions.py#L10

Especially the Darwin block... 4 different decompositions of each filename!
I mean, it's good to cover your bases... there are just so many of them.)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20170629/82f05c19/attachment.html>


More information about the llvm-commits mailing list