[PATCH] D30773: Make git-clang-format python 3 compatible

Sat Mar 11 03:10:25 PST 2017

mgorny added inline comments.

================
Comment at: tools/clang-format/git-clang-format:306
+    try:
+        return to_string(bytes.decode('utf-8'))
+    except AttributeError: # 'str' object has no attribute 'decode'.
----------------
EricWF wrote:
> mgorny wrote:
> > EricWF wrote:
> > > mgorny wrote:
> > > > This logic looks really weird to me. What is the purpose of having both `to_string()` and `convert_string()`? Why do `to_bytes()` and `to_string()` use `isinstance()` to recognize types, and here you rely on exceptions? Why is `to_string()` called after decoding?
> > > `to_string` is called after decoding because in python2 the result of decoding is a `unicode` type, and we need to encode it a `str` type. Hense to_string.
> > No offense intended but this sounds really horrible. In modern Python, everything is either `bytes` or `unicode`. The difference basically is that `str` in py2 was pretty much `bytes` (except that `bytes` explicitly removes some operations that are unsuitable for bytestrings), and that `str` in py3 is equivalent to `unicode` before.
> > 
> > So if you are specifically converting to `str`, it means that you want to have two distinct types in py2/py3. Which really sounds like you're doing something wrong.
> No offence taken. I had to do way to much work to answer your questions accurately which means the code is way too complicated or non-obvious.
> 
> However there are a couple of things you got wrong. In python2 everything is normally `bytes`, `str`, or `unicode`. The `to_string` method converts both `unicode` and `bytes` to the type `str`. 
> 
> > So if you are specifically converting to str, it means that you want to have two distinct types in py2/py3. Which really sounds like you're doing something wrong.
> 
> I don't think having the Python library return different types in py2/py3 means something is wrong. In fact that's exactly what's happening and  that's exactly what `convert_string` is trying to fix. In python 2 `stdout, stderr, and stdin` return strings, but in python 3 they return `bytes`. `convert_string` is meant to transform these different types into `str`.
> 
> Regardless I'll remove the call to `to_bytes` from inside `to_string` because it's too confusing.
In Python 2, `bytes` and `str` is the same type, e.g.:
```
In [13]: bytes('foo').__class__
Out[13]: str
```

In other words, in each version of Python there are two kinds of strings: binary strings (`std::string` in C++) and Unicode strings (alike `std::u32string`). In Python 2, `str` is binary string (`bytes` is also accepted for compatibility) and `unicode` is the Unicode string. In Python 3, `bytes` is binary string and `str` is Unicode string.

---

Now, let's consider subprocess streams. In Python 2 they use `str` -- which means binary strings. In Python 3 they use `bytes` -- i.e. also binary strings. So there's really no change here, except that Python 3 is more strict in handling different types.

So if you plan to use the data as binary data, you can just use the data directly. If you need to use Unicode strings, you can reliably use `.decode()` and `.encode()` to convert. If you really believe you need to use `str` (i.e. two distinct types at the same time), it just means that the code is most likely wrong and partially relies on characteristic of one type, and partially on the other.

---

For completeness, I should also mention that string types used by standard system streams (i.e. `sys.std*`) *have* changed between Python 2 and 3. Python 2 is operating them in `bytes` mode by default, while Python 3 is operating them (transcoding) in `str` (i.e. unicode) mode. In the latter case, you can use the `.buffer` attribute to access the underlying `bytes` stream -- e.g. `sys.stdin.buffer` will yield bytes.

https://reviews.llvm.org/D30773