<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/63584>63584</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            clang/tools/scan-build-py: nconsistent encoding for open()
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          vladimir-ivanov-ncbi
      </td>
    </tr>
</table>

<pre>
    We have problems with a Python version of static analyzer. We use LLVM 16.0.0, but as I see python scripts didn't changed since then in a git repo.

**$ /usr/local/llvm/16.0.0/bin/analyze-build -o . -vvvv**
```
analyze-build: DEBUG: document: count crashes and bugs
analyze-build: WARNING: report_directory: Run 'scan-view scan-build-2023-06-27-13-49-37-750428-xwteiddx' to examine bug reports.
analyze-build: ERROR: wrapper: Internal error.
Traceback (most recent call last):
  File "/usr/local/llvm/16.0.0/lib/libscanbuild/__init__.py", line 125, in wrapper
    return function(*args, **kwargs)
  File "/usr/local/llvm/16.0.0/lib/libscanbuild/analyze.py", line 89, in analyze_build
    number_of_bugs = document(args)
  File "/usr/local/llvm/16.0.0/lib/libscanbuild/report.py", line 35, in document
    for bug in read_bugs(args.output, html_reports_available):
  File "/usr/local/llvm/16.0.0/lib/libscanbuild/report.py", line 282, in read_bugs
    for bug in parser(bug_file):
  File "/usr/local/llvm/16.0.0/lib/libscanbuild/report.py", line 421, in parse_bug_html
    for line in handler.readlines():
  File "/opt/python-3.9/lib/python3.9/codecs.py", line 322, in decode
 (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 1223: invalid continuation byte
analyze-build: ERROR: wrapper: Please report this bug and attach the output to the bug report
```
```
$ python --version
Python 3.9.17
```

It fails on generating final `index.html`, it cannot read some generated `report-*.html` files. The problem is added utf8 encoding here: https://github.com/llvm/llvm-project/commit/70b06fe8a1862f5b9a0ef4e5e9098c1e457cf275

Python's `open()` uses "encoding=None" by default:
> The default encoding is platform dependent (whatever locale.getencoding() returns), but any text encoding supported by Python can be used.
https://docs.python.org/3/library/functions.html#open

So, you write generated reports using one encoding and read it as UTF-8, that is not correct. And lead to errors like above., the same encoding should be used everywhere.

I've tried to run Python with UTF-8 mode enabled... But didn't help.
https://peps.python.org/pep-0540/

```
$ locale charmap
ANSI_X3.4-1968

$ locale 
LANG=en_US.UTF-8
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=POSIX
```

</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJy8V92O4joSfhpzU0oUHAjJBRdMA2eReugWTe-cc4WcpEK8E-zIduhmn35VTuif2UarI41WQsGxXeXP9fNVRVgrjwpxzqbf2HQ5Ep2rtZmfG1HKkzSBPAulz4EqcjnKdXmZ_0CoxRmhNTpv8GThRboaBDxeXK0VnNFYqRXoCqwTThYglGgu_0YTwg-EziLc3__zO4yTMAojxu8g7xwICxuwiND2WmxhZOsslLJUjM8cFLVQRyzBSlUguBoVSAUCjtKBwVaHLFqyaDE8ef-bAOPrzhrG140uREP_zfnE-Pp6-jqXivH1ADHIO9mUEGgIITifz-dBUa80iYaff_0kwuIFLFffnv-gQamL7oTK0bjQnXJQGGFrtCBUCXl3tDc0_Fjstput10FXMu5QSoOF0-ZCc7tOAeMzWwgVnCW-gB956YBHPA6iJOCzYBwHkyyIZ8FsGk14Gry-OJRl-cr4DJwGfBUnqZBwDKfY8Aae1W73sKPBixFti4aGG-XQKNEAGqPNILk3osBcFD-B8fSkLbmkQLq4aBpohHWMZyweDAmwlg0C4_x_eqeRef-kq_a4-PpwkEq6wyFsL17HHTR0oTGf0liqN7jDaQAGXWcUVJ0qnNSK8ZTxhTBHSwK9j3--9O_Z78E42PIXiGk2IByWD_32N5yqO-VoDro6UJQAi5fvwcTT3wqwd_0v-OKrBd9OfYNWaeNDRiowKEoPcMAU6s61nSPR2p2awxBVB3EWshF5g7_T-V_i5ikfgL9j-wp5K4xFw3iad8dDJf8PyCZ8PCDzZxO0AxnpMzy_VSqohSobNCHdguasj9QbEHXrGF_3fBnEYfaGq5_qZwpdYmF_dTO_mqtE2jAoZzw1aLvGe7LQynYnLBnPfBxabKrwkHdVheYwiPG0FE7Qbr_qGcGnVCUV2W0I1WclafvSC61oExEJ47POVUFKtORRQiF6qu-1Q35xCNFrIbz1tJWUujDmPCZxqc6ikSXhdFJ1wi-SyN_gsscGhcWBBcHV0vpAIZYWzomipjoDfXQTddLbO21-WRV-feWTa0ULgqEy9itDtYzDLBzPvpb1z42DSsjGglZwRIVGOKmOvYWBJZFUJb6GPqQSX00lsa5S2vlcAKtPeBXEkiR69AHji6sYUCrYEPb1W1UHaUGUJZbQuSoFVIUu6dwaDZLpaudaS3HJ14yvj9LVXR4W-vSeLPQXtEb_CwvnA_F0kjSYRXmUVJiKcZrwappnIsJqglPMoiwtxjiZzoqKz6YfjdAbi2ofXUC3qIbESCJqKCxlxBUii5dbrZBxDvkFSqwERfQ1g1i88tcc5t8vJi20jXCVNicosUVVUv1iPH2phcMzGvBkgOER3dtRHsRQXTw1X7sZdQGHrx_U264lq2NJoAbfF0JB7juicqijn61aap-5tDfU5sj4Ou5z3AhzYXx9LWe29yOPvWE-mO1JE6CL7uDFSPcxDAaKhs4SOK3wHSlFv48c6buy5_2acvQOXC0cWYkiq9CG-pIQFqqEhjZTY-HTHxr5E0Hk-oxhL4ZgxenDAbbWXVNebw5k28sLxdWnFm7D-OyM4IxEr9106mo33296XHAiokBFRaYMwxC-de69Zayxab80bIvtL4ZtsQ2i6YS4_VMf-d_Z3EcBtaPmJNp-frF92hz-jMNJMM6S9HMj-ibRT9wvqL9bojo8P4W9bfv5u8Pd_q_HFYuXjPPHh6fNn8TY17Xt8_fVbnN3Y3W_-X5L8O7h_n6xv7X6_WG72i92f91aXj09Lf5YPd1Yflw8rna3AC9uQlosl7vV0y2l-9X96vEfD9ubkFeLp-fd6vtqu7-xY7Ncbfeb9eZusd88bG-BuL9n8bKf_srVo3Iel1mciRHOx0k6y5JkmsxG9TzLKyH4mIsp51hN8yQepzkfT7MMp7Msj0ZyTr14lPB0PI2TaBYmKU-KOCoSPqlElhdsEuFJyCYkhqT4G0lrO5wn8TSdjBqRY2P9xxjnCl_ALxL06XJk5p5VfX8ziRppnX3X4qRrcF40gohp7bRuLOPrD98Irf-KUFTZpXVEb285SU3IO62OOtPM_zbDe6B0pL_IfwIAAP__xAJoEQ">