[PATCH] D102729: [Demangle][Rust] Parse binders

Sat May 22 05:10:31 PDT 2021

tmiasko added inline comments.

================
Comment at: llvm/lib/Demangle/RustDemangle.cpp:471-476
+    if (consumeIf('L')) {
+      if (auto Lifetime = parseBase62Number()) {
+        printLifetime(Lifetime);
+        print(' ');
+      }
+    }
----------------
dblaikie wrote:
> Might be nice to have this written once/reused (rather than duplicated here between 'R' and 'Q' for instance) - not sure if it can be generalized over the use in `demangleGenericArg` too - all of them use the <lifetime> Production, but there's no common implementation of/function for that production - maybe there could be?
I reused the implementation between `R` and `Q` productions.

================
Comment at: llvm/lib/Demangle/RustDemangle.cpp:549
+  // Use the input size as a weak upper bound on the total number of bound
+  // lifetimes. Inputs exceeding the bound are invalid, but could generate
+  // excessive amounts of output.
----------------
dblaikie wrote:
> Judging by the phrasing, was this meant to say "Inputs exceeding the bound are /valid/" rather than invalid?
> 
> Maybe it'd be more suitable to have a fixed upper bound on the size of a demangled name in general instead of these special cases? I guess that could cover the recursion and this bound on bound lifetimes... 
> 
> That might be more predictable for users and for testing, etc.
I reworded the comment and added an explanation why those inputs are invalid.

I think we need at least two different limitation mechanisms. A recursion limit to avoid unbounded recursion once we start parsing backreferences, especially those that don't produce any output. And another limit to handle large output.

I decided to check for invalid binders specifically since they are uniquely problematic; short strings like `_RIC5crateFGAAAAA_EuE` already generate gigabytes of output. The proposed bound rejects only invalid inputs, so it shouldn't be surprising for users in practice.

A general limit on the output size would be alternative option. Although it requires answering the hard question what the limit should be.

================
Comment at: llvm/lib/Demangle/RustDemangle.cpp:707
 // Parses optional base 62 number. The presence of a number is determined using
 // Tag. Returns 0 when tag is absent and parsed value + 1 otherwise.
 uint64_t Demangler::parseOptionalBase62Number(char Tag) {
----------------
dblaikie wrote:
> This seems like a really subtle and confusing API, especially given the existing "+ 1" behavior of the Base 62 encoding. I've been staring at the code in demangleFnSig+demangleBinder to figure out the first binder test case for a while before I realized that.
> 
> Any chance this could use uintmax for the invalid value, rather than offsetting the result in this way?
There are two layers of `+1` which makes it quite confusing, but this is intended encoding of optional base62 numbers.

================
Comment at: llvm/test/Demangle/rust.test:237-238
+
+CHECK: binders::<for<'a> fn(&'a _)>
+       _RIC7bindersFG_RL0_pEuE
+
----------------
dblaikie wrote:
> I'm having a hard time understanding this test in the context of the spec. 
> 
> The spec says that `_` encodes zero. So the `G_` should be zero, right? Is there somewhere in the spec that says that `<binder>` mangled value is then 1 larger than that? (so a binder value 0 (`_`) is actually for 1 binder)?
This looks like an omission in the specification The binder production should be marked as optional, and then the presence of `G` participates in the encoding of the value in analogous way to `s` in disambiguators:

> If the disambiguator is not present, this index is 0. If it is of the form s_ then the index is 1. If it is of the form s<base-62-digit>_ then the index is <base-62-digit> + 2.

Does this clarify the situation?

I will look into updating the specification.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D102729/new/

https://reviews.llvm.org/D102729