[PATCH] D45550: Use GetArgumentVector to retrieve the utf-8 encoded arguments on all platforms

Stella Stamenova via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Thu Apr 12 09:13:59 PDT 2018


stella.stamenova added a comment.

There are a couple of things going on here:

1. The test itself is verifying that each of the tools can properly handle non-ASCII characters in the path. This indicates that each of the tools *should* handle non-ASCII characters in their inputs correctly
2. The test only passes on Windows with Python 2 because the lit code for Python 2 uses a byte array to pass the input thus breaking the one Unicode pound character into two characters. In Python 3, the pound character is passed correctly as a single character causing the test to fail on Windows. This is the same behavior as the command line in Windows - the tools fail to decode non-ASCII characters, so while we caught this in a test, this is a bug in the tools assuming they should handle non-ASCII characters correctly.

The reason the test fails on Windows without either the change that Aaron is undoing or the change to use GetArgumentVector is exactly because the inputs are not UTF-8 encoded, but rather encoded with the windows default code page. This means that we cannot treat the input as always UTF-8 encoded.

The original fix that we submitted was to modify the Windows Path.inc to try to decode the input as UTF-8 first (which is the Unix default encoding) and then try the windows default code page. This impacts the way any tool that uses Path.inc on Windows behaves and made *all* of the tools (llvm-mc, ld.lld, llvm-objdump, etc.) handle non-ASCII characters correctly whether they were encoded as UTF-8 or with the windows default code page. The drawback of this, as Eli pointed out, is that now the tools are *guessing* what the right encoding is. This happens to work a lot more frequently than assuming UTF-8 but it is not entirely correct.

The fix the Eli suggested was to use GetArgumentVector. On Windows, this uses a combination of GetCommandLineW and CommandLineToArgW <https://msdn.microsoft.com/en-us/library/windows/desktop/bb776391(v=vs.85).aspx> to return the arguments are proper UTF-8. Now we are no longer guessing what the encoding is and the tools again handle non-ASCII characters correctly. The drawback here is that every tool needs to be updated to use GetArgumentVector in order to properly handle non-ASCII characters on Windows. In this case, to get this one test to work, we have to touch three tools - ld.lld, llvm-mc and llvm-objdump. I briefly looked through the other tools and there's at least another half dozen that will need the same change to work correctly on Windows.

At the end of the day, using GetArgumentVector is the *correct* change which needs to be applied to all tools, but using the windows default code page is the change that works almost always for all the tools. In order to keep the code cleaner, we've opted for the correct change with the understanding that we had to update at least the three tools to keep the test passing on Windows with Python 3 and that eventually the rest of the tools will need the same update.


Repository:
  rLLD LLVM Linker

https://reviews.llvm.org/D45550





More information about the llvm-commits mailing list