[clang] [llvm] [HLSL][RootSignature] Implement parsing of `RootFlags` (PR #121799)

Tue Jan 7 13:49:09 PST 2025

llvm-beanz wrote:

> I didn't implement the tokenizer because I found that the extra level of abstraction to be redundant/not beneficial with the StringRef operations.

I disagree, and generally string operations tend to muddy up and can lead to inefficiencies.

> Looking at the DXC implementation, the usage of the Tokenizer is either GetAndMatchToken, or, getToken for an identifier with a switch on a small subset of tokens. These are effectively just StringRef::consume_front/StringSwitch with the buffer abstracted into the Tokenizer.

I don't think DXC's implementation is a good reference.

> Since we can just go through the buffer from left to right and construct the RootElements in place, then we will not reference a previous token, and so, defining/lexing an intermediate Token seems redundant.
> 
> What aspects are you referring to that would warrant it?

In general, we should lex tokens once, transform them to enums and associated state and move on. Let's take this root signature as an example:

```
DescriptorTable(CBV(b0, space=1)))
```

This becomes a token stream something like:
* `DescriptorTable` - keyword
* `(` - lparen
* `CBV` - keyword
* `(` - lparen
* `b0` - register
* `,` - comma
* `space` - keyword
*  `=` - equal
* `1` - number
* `)` - rparen
* `)` - rparen
* `)` - rparen

Having a token representation where keywords and grammar tokens are converted to enumerations prevents having string or character operations throughout the parser. This is in line with Clang's tokenizer design, and seems like something we should also match.

Having the tokenizer also be able to pre-parse numbers and register tokens into constituent parts ensures that the lexing errors are simple to emit and occur where expected.

The lexing rules for HLSL are pretty simple. I would probably write the Lexer in an iterator pattern and just have a token iterator that walks token to token with a small copyable state. That would allow lookahead where necessary. You don't need to design this the way I would, Clang's model of the "Parser" preserving the current lexer state is also reasonable (that is more similar to how DXC implements this.

In either case, abstracting string and pointer manipulation is really important. If we add new root signature keywords I shouldn't need to add new logic for string comparisons. I would look at how Clang's TokenKinds.def defines keyword and punctuator tokens and I would look to define our parser similarly.

https://github.com/llvm/llvm-project/pull/121799