[clang] [Clang] Improve EmitClangAttrSpellingListIndex (PR #114899)

Thu Nov 7 06:35:16 PST 2024

erichkeane wrote:

> Hi @erichkeane, thanks for your detailed review and comments.
> 
> I had some thoughts about your optimized approach which I would like to share.
> 
> I agree that falling back to full string comparisons is "safe" but non-optimal. However, performing "first different character" comparison might get complicated if you have more than 2 different spelling names with the same size.
> 
> An example case is `AT_NoSanitizeSpecific` with - "no_sanitize_thread" and "no_sanitize_memory" (size 18). In this case it is easy to check for the differing character by using `std::mismatch`. However, a case where there are 3 such strings (like "aaaba", "aaacb", "aaabc") would involve generation of a much more complex conditional. Although rare, I'm not sure if we can ignore this situation. We may sacrifice readability of the emitter code / debuggability of the `.inc` file at the cost of minimal optimizations (full string comparisons now happen in only 6 cases).
> 
> I'm happy to add the FIXME note, but would like to hear what you think.

So the conditional actually doesn't have to be all that difficult.  You need N-1 comparisons to differentiate (though you m ight end up with duplicates).  For example, in your 3 strings example if differentiating the first one, my algorithm would generate Str.size() == 5 && Str[3] == 'b' && Str[3] == 'b' (since that is the difference character for each).  It is simple enough join with an 'and' and a loop, as long as we don't care about duplicates.

As far as debugability/readability of emitted code: it isn't really a concern here.  Once we get the tablegen 'right' (which traditionally doesn't take much effort/testing), no one ever gets into here again.  The spelling-identification one uses a trie, and it is a giant mess of undebugability :) 

> 
> To address a previous comment -
> 
> > This seems to have missed a bunch, see alloc_size and allocating and alloc_align, all of which are the same spellings, but you're checking the name anyway (in the .inc file you linked).
> 
> I'm confident that my earlier approach did not perform string comparisons for "alloc_size", "allocating" and "alloc_align". The 2 tests that were failing were because I assumed that spellings with the same name would have consecutive indices. This failed for `AT_unused`. But I concede that the earlier loop did not leverage the `.size()` for a more optimal comparison :) This way is definitely better.
> 

Got it!  It wasn't clear based on the old implementation what was happening, just from the GIST that something wasn't working.

> [Gist](https://gist.github.com/chinmaydd/1733a9a84932a45678c47f31be4d64d7) contains updated output.

That GIST looks about right.  I see as you mentioned, ~6 different comparisons, which isn't too bad.  Definitely an improvement.

https://github.com/llvm/llvm-project/pull/114899