<html>
<head>
<base href="https://bugs.llvm.org/">
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW - non-ascii source files cannot be shown in scan-view"
href="https://bugs.llvm.org/show_bug.cgi?id=40765">40765</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>non-ascii source files cannot be shown in scan-view
</td>
</tr>
<tr>
<th>Product</th>
<td>clang
</td>
</tr>
<tr>
<th>Version</th>
<td>trunk
</td>
</tr>
<tr>
<th>Hardware</th>
<td>PC
</td>
</tr>
<tr>
<th>OS</th>
<td>Linux
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>enhancement
</td>
</tr>
<tr>
<th>Priority</th>
<td>P
</td>
</tr>
<tr>
<th>Component</th>
<td>Static Analyzer
</td>
</tr>
<tr>
<th>Assignee</th>
<td>dcoughlin@apple.com
</td>
</tr>
<tr>
<th>Reporter</th>
<td>johannes@sipsolutions.net
</td>
</tr>
<tr>
<th>CC</th>
<td>dcoughlin@apple.com, llvm-bugs@lists.llvm.org
</td>
</tr></table>
<p>
<div>
<pre>We just get:
INTERNAL ERROR
Traceback (most recent call last):
File "/usr/share/clang/scan-view-9/share/ScanView.py", line 232, in do_GET
SimpleHTTPRequestHandler.do_GET(self)
File "/usr/lib/python2.7/SimpleHTTPServer.py", line 45, in do_GET
f = self.send_head()
File "/usr/share/clang/scan-view-9/share/ScanView.py", line 712, in send_head
return self.send_path(path)
File "/usr/share/clang/scan-view-9/share/ScanView.py", line 727, in send_path
return self.send_patched_file(path, ctype)
File "/usr/share/clang/scan-view-9/share/ScanView.py", line 774, in
send_patched_file
return self.send_string(data, ctype, mtime=fs.st_mtime)
File "/usr/share/clang/scan-view-9/share/ScanView.py", line 747, in
send_string
encoded_s = s.encode()
UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 111162:
ordinal not in range(128)
In ScanView.py line 747 we have:
encoded_s = s.encode()
changing that to just
encoded_s = s
appears to work around the problem.
It's not clear what _should_ be done about this though. Clearly, C source files
can be any sort of encoding, in particular in comments, and we can't really
know which it is. Most files we have are UTF-8, but some older ones are
ISO-8859-1 or similar encodings, depending on whatever the author wrote ... I
guess ideally it's just passed through more or less, and then worst case some
stuff shows up as garbage in the browser, still better than crashing.</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>