[lldb-dev] [llvm-dev] Adding DWARF5 accelerator table support to llvm

Tue Jan 30 08:20:29 PST 2018

> On Jan 30, 2018, at 7:49 AM, Pavel Labath <labath at google.com> wrote:
> 
> On 30 January 2018 at 15:41, Adrian Prantl <aprantl at apple.com> wrote:
>> 
>> 
>>> On Jan 30, 2018, at 7:35 AM, Pavel Labath <labath at google.com> wrote:
>>> 
>>> Hello all,
>>> 
>>> I am looking for feedback regarding implementation of the case folding
>>> algorithm for .debug_names hashes.
>>> 
>>> Unlike the apple tables, the .debug_names hashes are computed from
>>> case-folded names (to enable case-insensitive lookups for languages
>>> where that makes sense). The dwarf5 document specifies that the case
>>> folding should be done according the the "Caseless matching" Section
>>> of the Unicode standard (whose implementation is basically a long list
>>> of special cases). While certainly possible, implementing this would
>>> be much more complicated (and would probably make the code a bit
>>> slower) than a simple tolower(3) call. And the benefits of this are
>>> not really clear to me.
>> 
>> Assuming a UTF-8 encoding, will tolower(3) destroy any non-ASCII characters in the process? In Swift, for example, we allow a wide range of unicode characters in identifiers and I want to make sure that this doesn't cause any problems.
>> 
> 
> I'm not sure what it will do out-of-the-box, but I could certainly
> implement it such that it does not touch the fancy characters.
> 
> However, if we already have unicode characters in the input, then it
> may make sense to go all the way and implement the full folding
> algorithm. Because, once we start producing hashes like this, it will
> be hard to switch to being fully standard-compliant (as that would
> invalidate the existing hashes).
> 
> But the question then is: can I assume the input names will be unicode
> (w/utf8 encoding)?

We can make that happen and encode it explicitly in each compile unit:

> 3.1.1 Full and Partial Compilation Unit Entries
> ...
> A DW_AT_use_UTF8 attribute, which is a flag whose presence indicates that all strings (such as the names of declared entities in the source program, or filenames in the line number table) are represented using the UTF-8 representation. 

-- adrian