[PATCH] D101444: [demangler] Initial support for the new Rust mangling scheme

Tomasz Miąsko via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Fri Apr 30 05:46:35 PDT 2021


tmiasko added a comment.

@dblaikie thanks for review!



================
Comment at: llvm/include/llvm/Demangle/RustDemangle.h:8
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_DEMANGLE_RUSTDEMANGLE_H
----------------
dblaikie wrote:
> Got a link to a spec for the mangling scheme that could go somewhere up here? (and/or in the .cpp file)
Included a link to a specification in the .cpp file.


================
Comment at: llvm/include/llvm/Demangle/RustDemangle.h:31
+  // Maximum recursion level. Used to avoid stack overflow.
+  const size_t MaxRecursionLevel;
+  // Current recursion level.
----------------
dblaikie wrote:
> Does the itanium demangler have a similar recursion limit handler? I didn't find one while quickly glancing around - would be good to understand any differences/tradeoffs here.
In Rust mangling scheme, backreferences are given as offsets in the mangled symbol. Without any limits or additional protections, it would be easy to create a symbol that would lead to a stack overflow during demangling. I would consider a recursion limit to be necessary for fuzzing.

As far as I understand this isn't a problem for itanium demangler, where substitutions cannot refer to components that are yet to be demangled. Although the implementation has some form of protection against stack overflow in the form of `Printing` field.


================
Comment at: llvm/include/llvm/Demangle/RustDemangle.h:35
+
+  // Input string that is being demangled (without "_R" prefix if any).
+  StringView Input;
----------------
dblaikie wrote:
> This comment seems incorrect - the `demangle` function is passed a string that includes the "_R" prefix, and parses that before doing anything else?
I reorganized the code so that input is assigned only after removing the prefix.

The motivation for this becomes more apparent later; backreferences start from the end of the prefix.


================
Comment at: llvm/include/llvm/Demangle/RustDemangle.h:49
+
+  bool demangle(const char *Mangled, size_t MangledLen);
+
----------------
dblaikie wrote:
> Looks like this part of the API isn't designed to be drop-in compatible with the itanium demangler API, yeah? (because the itanium one takes the buffer in the ctor rather than in a separate function, and takes the buffer as start/end rather than start/length)
> 
> If that's the case - any particular reason for this divergence and if we're diverging anyway, perhaps this could use a StringView parameter type rather than a decomposed start+length?
I expect that rustDemangle will be the entry point, so there are no compatibility concerns here.

Replaced the arguments with a StringView.


================
Comment at: llvm/lib/Demangle/RustDemangle.cpp:59-62
+      std::free(Demangled);
+      Demangled = Buf;
+    } else {
+      std::free(Buf);
----------------
dblaikie wrote:
> Any chance of using less raw memory management here? (been some recent discussions about the complexities of memory management in the itanium demangler that, if possible, it'd be good to avoid ( https://reviews.llvm.org/D100800 ) - but if it's all wrapped up in the "wanting to conform to the same API" (ie: the API design itself is awkward, but compatibility trumps fixing that awkwardness) - such is life)
The motivation was to be compatible with itanium demangler API. I feel that interface will be easier to integrate.

If you see trade-offs differently, I am open to changing it to avoid manual memory management. Just consider how a different API would fit in in places where itaniumDemangle is used right now. 

Internally, for the demangler itself, it is of little significance if demangled output is appended to an OutputStream or say an std::string.


================
Comment at: llvm/lib/Demangle/RustDemangle.cpp:93
+//
+// <symbol-name> = "_R" [<decimal-number>] <path> [<instantiating-crate>]
+bool Demangler::demangle(const char *Mangled, size_t MangledLength) {
----------------
dblaikie wrote:
> The decimal-number and instantiating-crate parts aren't implemented in this patch, I take it? Could you include a FIXME that mentions that?
Added a FXIME for <instantiating-crate>, and removed [<decimal-number>] which specification reserves for an encoding version. Encoding version is not in use currently, so there is no reason to parse it.


================
Comment at: llvm/lib/Demangle/RustDemangle.cpp:140-141
+  case 'N': {
+    const char NS = consume();
+    if (!isLower(NS) && !isUpper(NS)) {
+      Error = true;
----------------
dblaikie wrote:
> Is this correct/am I understanding this correctly, that some parsed components are not rendered in the demangling? Is that intentional/valid - that multiple mangled names could demangle to the same string/be ambiguous? Or is this a case of some feature not being implemented yet? 
Yes, some components are not shown.

In the case here we have two kinds of namespaces: implementation internal namespaces using lowercase letters, and special well-known namespaces using uppercase letters. The former are not intended to be shown. The latter currently includes closures `C` and shims `S` and are shown by the existing demanglers.

This is yet to be implemented. I added a FIXME.


================
Comment at: llvm/lib/Demangle/RustDemangle.cpp:168
+  const uint64_t Bytes = parseDecimalNumber();
+  consumeIf('_');
+
----------------
dblaikie wrote:
> This is here in case the decimal-number and initial bytes value would be ambiguous?
Exactly. "_" resolves an ambiguity in the case identifier itself starts with a digit or "_". Added a comment reflecting that.


================
Comment at: llvm/lib/Demangle/RustDemangle.cpp:177-182
+  for (const char C : S) {
+    if (!isValid(C)) {
+      Error = true;
+      return {};
+    }
+  }
----------------
dblaikie wrote:
> 
I would find it convenient to leave this unaddressed for the time being. Currently implementation has no build time dependency on any generated headers or libraries, and using STLExtras.h introduces both.


================
Comment at: llvm/test/Demangle/rust.test:3-4
+
+CHECK: a::main
+       _RNvC1a4main
+
----------------
dblaikie wrote:
> I'd have a bit of an easier time following the test cases if there was a comment describing the parse tree - would that be worth including? 
I tried a few different notations, but even for short symbols they feel too verbose to be included in tests. For example:

```
_RNvNtCshGpAVYOtgW1_1a1b1c 

a::b::c

(symbol-name "_R"
  (path "N" (namespace "v")
    (path "N" (namespace "t")
      (path "C" (identifier (disambiguator "s" (base-62-number "hGpAVYOtgW1_")) "1a"))
      (identifier "1b"))
    (identifier "1c")))
```

I also have another parser that produces output as below, if that would be helpful I could publish it:

```
"" <symbol-name>
"_R" [<decimal-number>] <path> [<instantiating-crate>]
"_R" <path> [<instantiating-crate>]
"_R" "N" <namespace> <path> <identifier> [<instantiating-crate>]
"_RN" <namespace> <path> <identifier> [<instantiating-crate>]
"_RNv" <path> <identifier> [<instantiating-crate>]
"_RNv" "N" <namespace> <path> <identifier> <identifier> [<instantiating-crate>]
"_RNvN" <namespace> <path> <identifier> <identifier> [<instantiating-crate>]
"_RNvNt" <path> <identifier> <identifier> [<instantiating-crate>]
"_RNvNt" "C" <identifier> <identifier> <identifier> [<instantiating-crate>]
"_RNvNtC" <identifier> <identifier> <identifier> [<instantiating-crate>]
"_RNvNtC" [<disambiguator>] <undisambiguated-identifier> <identifier> <identifier> [<instantiating-crate>]
"_RNvNtC" ["s" <base-62-number>] <undisambiguated-identifier> <identifier> <identifier> [<instantiating-crate>]
"_RNvNtCs" <base-62-number> <undisambiguated-identifier> <identifier> <identifier> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_" <undisambiguated-identifier> <identifier> <identifier> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_" ["u"] <decimal-number> ["_"] <bytes> <identifier> <identifier> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_" <decimal-number> ["_"] <bytes> <identifier> <identifier> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_1" ["_"] <byte> <identifier> <identifier> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_1" <byte> <identifier> <identifier> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_1a" <identifier> <identifier> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_1a" [<disambiguator>] <undisambiguated-identifier> <identifier> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_1a" <undisambiguated-identifier> <identifier> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_1a" ["u"] <decimal-number> ["_"] <bytes> <identifier> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_1a" <decimal-number> ["_"] <bytes> <identifier> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_1a1" ["_"] <byte> <identifier> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_1a1" <byte> <identifier> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_1a1b" <identifier> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_1a1b" [<disambiguator>] <undisambiguated-identifier> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_1a1b" <undisambiguated-identifier> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_1a1b" ["u"] <decimal-number> ["_"] <bytes> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_1a1b" <decimal-number> ["_"] <bytes> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_1a1b1" ["_"] <byte> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_1a1b1" <byte> [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_1a1b1c" [<instantiating-crate>]
"_RNvNtCshGpAVYOtgW1_1a1b1c" [<path>]
"_RNvNtCshGpAVYOtgW1_1a1b1c"
```



================
Comment at: llvm/test/Demangle/rust.test:25-31
+; Unsupported mangling version
+
+CHECK: _R0NvC1a4main
+       _R0NvC1a4main
+
+CHECK: _R1NvC1a4main
+       _R1NvC1a4main
----------------
dblaikie wrote:
> Maybe this comment could be more descriptive - currently the implementation doesn't support any mangling version? (is the plan to support this currently, or is the version field purely forward-looking/for future use?) Maybe the comment could read "Only the null mangling version is currently supported" or something like that?
> 
> I'd probably only test 0 if that's the case - doesn't seem like 1 adds especially more coverage? But don't mind.
Added a comment explaining the situation with encoding version, and removed other test cases.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D101444/new/

https://reviews.llvm.org/D101444



More information about the llvm-commits mailing list