[PATCH] D45550: Use GetArgumentVector to retrieve the utf-8 encoded arguments on all platforms

Thu Apr 12 12:54:34 PDT 2018

stella.stamenova added a comment.

Speaking of the *test* itself, the fact that Python 2 uses CreateProcessA on Windows is just one of the problems.

This function (from utils\lit\lit\util.py) is the reason the test passes on Windows without any of the changes when run in Python 2. What this ends up doing is calling `return b.encode('utf-8')` on the inputs in the test (since they are Unicode and not simple strings), so the path being passed to llvm-mc, ld.lld and llvm-objdump in the test no longer contains the pound sign (£), but rather contains two separate bytes representing pound. Neither of the bytes is really a proper ASCII character, so in a way, the test is still verifying that the tools will accept non-ASCII characters. I think this defeats the purpose of the test, however, since we cannot confirm through it that *any* non-ASCII character will be treated correctly and the result doesn't contain the character we started with in the first place. Python 3 (through its addition of str as Unicode) does pass the character correctly as pound to the tools.

This is a limitation of the current lit infrastructure and for the test to really be valid, lit would need some additional changes. I do think that as long as the tools are expected to support non-ASCII characters, this test and others like it should exist to verify that it is the case. In fact, all tools that are expected to support non-ASCII characters should have at least one test.  (BTW, I am not sure whether there are other tests currently that are similarly impacted by the `to_string` function since not all tests run on Windows and we use Python 3 only for our Windows testing.)

  def to_string(b):
      """Return the parameter as type 'str', possibly encoding it.

      In Python2, the 'str' type is the same as 'bytes'. In Python3, the
      'str' type is (essentially) Python2's 'unicode' type, and 'bytes' is
      distinct.

      """
      if isinstance(b, str):
          # In Python2, this branch is taken for types 'str' and 'bytes'.
          # In Python3, this branch is taken only for 'str'.
          return b
      if isinstance(b, bytes):
          # In Python2, this branch is never taken ('bytes' is handled as 'str').
          # In Python3, this is true only for 'bytes'.
          try:
              return b.decode('utf-8')
          except UnicodeDecodeError:
              # If the value is not valid Unicode, return the default
              # repr-line encoding.
              return str(b)

      # By this point, here's what we *don't* have:
      #
      #  - In Python2:
      #    - 'str' or 'bytes' (1st branch above)
      #  - In Python3:
      #    - 'str' (1st branch above)
      #    - 'bytes' (2nd branch above)
      #
      # The last type we might expect is the Python2 'unicode' type. There is no
      # 'unicode' type in Python3 (all the Python3 cases were already handled). In
      # order to get a 'str' object, we need to encode the 'unicode' object.
      try:
          return b.encode('utf-8')
      except AttributeError:
          raise TypeError('not sure how to convert %s to %s' % (type(b), str))

Repository:
  rLLD LLVM Linker

https://reviews.llvm.org/D45550