[PATCH] D56754: Add Support for Creating and Deleting Unicode Files and Directories in Lit

Jason Mittertreiner via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Tue Feb 12 13:15:12 PST 2019

jmittert marked an inline comment as done.
jmittert added inline comments.

Comment at: utils/lit/lit/util.py:444
+    # convert it to ascii
+    return text if isinstance(text, bytes) else text.encode('utf-8')
compnerd wrote:
> jmittert wrote:
> > serge-sans-paille wrote:
> > > jmittert wrote:
> > > > serge-sans-paille wrote:
> > > > > Bad news: I've been discussing with a Python Core Dev and he acknowledge this function to be ultra dangerous. We should find a way to have everything right without this tricky call.
> > > > > 
> > > > > My take would be to first have the whole script work in unicode mode (i.e. in Python3) and then do the minimal work to have it work on Python2. Does that makes sense to you?
> > > > Let me check that I understand correctly. Right now, lit uses strings on python3 which are unicode aware, and strings/bytes on python2 which are not unicode aware which causes issues when utf8 characters are passed to os.path on Windows.
> > > > 
> > > > What about alternatively adding a to_unicode a la the to_string and to_bytes in util.py which would return str on python3, and unicode on python2?
> > > > 
> > > > Then adding something like
> > > > ```
> > > > s = some_string
> > > > if windows:
> > > >     s = to_unicode(s)
> > > > os.path.some_func(s)
> > > > ```
> > > > before each call to os.path.some_func?
> > > Your first statement is correct. However I don't understand why there's anything windows specific in there. Let think about it this way: in Python2, strings where used to represent two different kind of objects : raw bytes, and unicode string. They are different types in Python3. So my advice would be:
> > > 
> > > 1. have the code work in Python3, inserting bytes to string conversion whenever needed. These conversion should not be platform specific.
> > > 
> > > 2. try to run the same code under Python2 and see where it fails.
> > > 
> > > Does that look like a reasonable path to you?
> > Hmm, I was under the impression that that was exactly what I've done here.
> > 
> > Lit works as of now with python3 on both Windows and Unix. The "convertToLocalEncoding" is essentially the fix up for when I ran it in python2 and saw where it failed. In order for the os.path functions to handle unicode properly on python 2, they need to take a `unicode` string on Windows. Otherwise, the os.path functions create/open a file with the literal utf-8 bytes which causes a broken filename for every other application that expects utf-16 on Windows. On Unix, utf-8 means things work fine in python2 as is. Passing a `unicode` type to os.path on unix/python2 gives an error about not being able to decode to ascii,  but just passing bytes works. 
> > 
> > Right now, with python 2 lit passes utf-8 bytes around (the python2 string/byte type) which works for Unix, but needs converting on Windows. What I have right now shouldn't result in any functional changes except for python2 with Windows.
> @serge-sans-paille this is a nifty workaround for Windows.  When on the python side, a unicode object is passed, the C side will use the `W` versions of the file APIs rather than `A` versions which is required to access any non-ASCII file path.  Additionally, the use of the `W` variant of the APIs permits the use of NT style paths (`\\?\` prefixed paths) which bypass the Win32 layer and go right to the kernel to avoid the 261 character limit.

@compnerd mentioned to me that you wanted to have this function changed to be called something along the lines of "convertToUnicode". I should mention though that this only converts to Unicode on Windows, On non Windows platforms it converts to bytes, not the unicode type. Think that convertToPlatformEncoding is maybe more descriptive instead?




More information about the llvm-commits mailing list