[llvm] [llvm-rc] Concatenate consecutive string tokens (PR #68685)

Tue Oct 10 09:21:13 PDT 2023

mstorsjo wrote:

> > Is memory allocation in the tokenizer a no-go?
> 
> Practically speaking - it would most probably be fine; the amounts of data processed by llvm-rc are miniscule anyway. But that would probably be a much bigger refactoring, something i didn't really want to get into for this somewhat small feature.

After thinking more about ways to make this better; you’re right, if we’d want this solution to be robust, we would probably need to re-invoke the original tokenizer when splitting the string in the end; that’s not very elegant.

I thought about if we could extend `RCToken` to contain a vector of `StringRef`s - but I think that would become quite messy for everything that isn’t a string token. Or what do you think?

I considered if one actually would concatenate the payload of the string literals when we want to append them to the existing token. However that’s problematic; we currently need to keep the `L””` or `””` around the strings until the end, since we need to know whether the data was defined as a long string or not.

One compromise would be to keep the current split logic at the end; but not create a `StringRef` that spans the input file across all the sections, but instead allocate an extra string buffer, and append the fully quoted string literals there. It’s essentially the same solution as the current PR, but with only controlled whitespace between the string literals. It still requires doing some amount of parsing at the end to split them, but it should be at least marginally less brittle than what I have right now.

WDYT about the alternatives above?

https://github.com/llvm/llvm-project/pull/68685