<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - MSAN read-past-string-end in CXString"

   href="https://bugs.llvm.org/show_bug.cgi?id=35896">35896</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>MSAN read-past-string-end in CXString

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>clang

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>All

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>normal

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>libclang

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedclangbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>steve@obrien.cc

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>klimek@google.com, llvm-bugs@lists.llvm.org

          </td>

        </tr></table>

      <p>

        <div>

        <pre>Class `CXString` has a method `createRef(llvm::StringRef)` that tries to

reference the bytes of an existing string, without copying, if possible.  (We

can assume the pre-existing string bytes' memory remains unchanged, allocated,

and otherwise "good".)

A `StringRef` represents a run of sequential chars in memory; whereas a

`CXString` always points to a C-like string, i.e., there must be an array

somewhere of bytes, terminated by a NUL character.

`StringRef` doesn't have that NUL terminator requirement; so `createRef`, which

wants to recycle existing memory might be dealing with a NUL-terminated string

(which it can reuse) or otherwise has to copy the non-NUL terminated bytes into

a new array, with one extra byte for that terminator.

The trouble is this: `CXString` checks the byte at `str[stringLength]`, which

is technically out-of-bounds for the string.  If that byte is 0 then it's a

NUL-terminated C string and it can be reused (otherwise it has to be copied).

Since that access is one past the bounds of the string, this raises an MSAN

error.

One easy fix is to always copy the string data and never attempt to reuse bytes

from a `StringRef`.  I fear that increased byte-copies will waste both memory

and CPU.  (As correct as this approach is, it's inefficient.)

Another is to make `CXString`s look more like `StringRef`s, and include a

length / end-of-string pointer, to avoid the NUL requirement.  But as this

library is used in primarily another language (via `cindex` python bindings)

I'm not sure whether this is feasible or not.</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>