[PATCH] D43165: [lit] Fix problem in how Python versions open files with different encodings

Wed Feb 21 13:14:31 PST 2018

stella.stamenova added a comment.

In https://reviews.llvm.org/D43165#1012857, @MatzeB wrote:

> Reading python docu and playing a bit with a prototype I believe this is the way to do things:
>
>   import sys
>   import difflib
>   f = open(sys.argv[1], 'rb').readlines()
>   f2 = open(sys.argv[2], 'rb').readlines()
>   if hasattr(difflib, 'diff_bytes'):
>       # python 3.5 or newer
>       gen = difflib.diff_bytes(difflib.unified_diff, f, f2)
>       sys.stdout.buffer.writelines(gen)
>   else:
>       # python 2.7
>       gen = difflib.unified_diff(f, f2)
>       sys.stdout.writelines(gen)
>
>
> (python 3.0-3.4 difflib before diff_bytes appears to be broken indeed for inconsistent/invalid encoded inputs, but I hope we can just ignore old 3.x versions...)

This actually has another issue (other than python 3.0 - 3.4). After we read the lines from the file, the code assumes that the lines are all strings for the purpose of ignoring whitespace.

So we have three options, none of which are 100% safe if we want to keep the change localized to here (options are below).

There's a fourth option which is to re-write the tests to explicitly compare binary files (I think somebody already suggested that). This would mean adding a function here which compares binary files (or at least an option to the existing function) and changing the tests. Some of the tests right now compare all the files in two folders, rather than individual files, so the tests would need changes to compare all files EXCEPT the binary files and then explicitly the binary files. This is not a trivial change either and we would have to make sure that we find all such tests in LLVM, LLD, Clang and LLDB.

Here are the options if we want to make a change just to the TestRunner. I've not run ALL the tests across LLVM, LLDB, Clang and LLD with any of these changes yet.

1. Open as binary, then decode, do whitespace operations, encode again and do the comparison with diff_bytes or unified_diff. We are assuming we can safely remove whitespace characters from binary files and we have an issue with python 3.0 - 3.4. The decode is using the same encoding as diff_bytes.

  with open(file, 'rb') as f:
      filelines.append(f.readlines())
  for idx, lines in enumerate(filelines):
      filelines[idx]= [line.decode('ascii', 'surrogateescape') for line in lines]

  # do white space operations

  for idx, lines in enumerate(filelines):
      filelines[idx]= [line.encode('ascii', 'surrogateescape') for line in lines]

  if hasattr(difflib, 'diff_bytes'):
      # python 3.5 or newer
      diffs = difflib.diff_bytes(difflib.unified_diff, filelines[0], filelines[1], filepaths[0].encode(), filepaths[1].encode())
      diffs = [diff.decode() for diff in diffs]
  else:
      # python 2.7
      func = difflib.unified_diff if unified_diff else difflib.context_diff
      diffs = func(filelines[0], filelines[1], filepaths[0], filepaths[1])

2. Open as binary, then decode and proceed as usual. We are assuming we can safely remove whitespace characters from binary files and that comparing the decoded strings will produce the desired results. The decode is using the same encoding as diff_bytes.

  with open(file, 'rb') as f:
      filelines.append(f.readlines())
  for idx, lines in enumerate(filelines):
      filelines[idx]= [line.decode('ascii', 'surrogateescape') for line in lines]

3. Try to open the file assuming it's text and if we fail, open as binary. Then compare as binary or string depending on how the file was opened. This is more complicated and we still have an issue for python versions between 3.0 and 3.4 BUT it preserves the original logic as much as possible and we're not removing whitespace characters from binary files.

  compare_bytes = False
  for file in filepaths:
      if compare_bytes is False:
          # Assume every file is a text file
          try:
              with open(file, 'r') as f:
                  filelines.append(f.readlines())
          except UnicodeDecodeError:
              with open(file, 'rb') as f:
                  filelines.append(f.readlines())
                  compare_bytes = True
      else:
          # Since we are opening two files to compare against each other, ensure that both files are opened as binary if one is
          with open(file, 'rb') as f:
              filelines.append(f.readlines())

  if compare_bytes is False:
      # perform logic to ignore whitespace

  if compare_bytes is True:
      if hasattr(difflib, 'diff_bytes'):
          # python 3.5 or newer, we will use diff_bytes
          compare_bytes = True
      else:
          # python 2.7, we can compare the bytes as strings
          compare_bytes = False

  if compare_bytes:
      # python 3.5 or newer
      diffs = difflib.diff_bytes(difflib.unified_diff, filelines[0], filelines[1], filepaths[0].encode(), filepaths[1].encode())
      # To be able to print the diffs below, we need to decode them to strings
      diffs = [diff.decode() for diff in diffs]
  else:
      # python 2.7
      func = difflib.unified_diff if unified_diff else difflib.context_diff
      diffs = func(filelines[0], filelines[1], filepaths[0], filepaths[1]) 

Repository:
  rL LLVM

https://reviews.llvm.org/D43165