[PATCH] D102729: [Demangle][Rust] Parse binders

David Blaikie via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Mon May 24 17:09:14 PDT 2021


dblaikie added inline comments.


================
Comment at: llvm/lib/Demangle/RustDemangle.cpp:471-476
+    if (consumeIf('L')) {
+      if (auto Lifetime = parseBase62Number()) {
+        printLifetime(Lifetime);
+        print(' ');
+      }
+    }
----------------
tmiasko wrote:
> dblaikie wrote:
> > Might be nice to have this written once/reused (rather than duplicated here between 'R' and 'Q' for instance) - not sure if it can be generalized over the use in `demangleGenericArg` too - all of them use the <lifetime> Production, but there's no common implementation of/function for that production - maybe there could be?
> I reused the implementation between `R` and `Q` productions.
any chance of having a "demangleLifetime" that can be used in "demangleGenericArg" as well as here? (nice to have a function for each production, so it's easier to find the implementation of that production in one place)


================
Comment at: llvm/lib/Demangle/RustDemangle.cpp:549
+  // Use the input size as a weak upper bound on the total number of bound
+  // lifetimes. Inputs exceeding the bound are invalid, but could generate
+  // excessive amounts of output.
----------------
tmiasko wrote:
> dblaikie wrote:
> > Judging by the phrasing, was this meant to say "Inputs exceeding the bound are /valid/" rather than invalid?
> > 
> > Maybe it'd be more suitable to have a fixed upper bound on the size of a demangled name in general instead of these special cases? I guess that could cover the recursion and this bound on bound lifetimes... 
> > 
> > That might be more predictable for users and for testing, etc.
> I reworded the comment and added an explanation why those inputs are invalid.
> 
> I think we need at least two different limitation mechanisms. A recursion limit to avoid unbounded recursion once we start parsing backreferences, especially those that don't produce any output. And another limit to handle large output.
> 
> I decided to check for invalid binders specifically since they are uniquely problematic; short strings like `_RIC5crateFGAAAAA_EuE` already generate gigabytes of output. The proposed bound rejects only invalid inputs, so it shouldn't be surprising for users in practice.
> 
> A general limit on the output size would be alternative option. Although it requires answering the hard question what the limit should be.
Fair enough - thanks for the details!


================
Comment at: llvm/lib/Demangle/RustDemangle.cpp:707
 // Parses optional base 62 number. The presence of a number is determined using
 // Tag. Returns 0 when tag is absent and parsed value + 1 otherwise.
 uint64_t Demangler::parseOptionalBase62Number(char Tag) {
----------------
tmiasko wrote:
> dblaikie wrote:
> > This seems like a really subtle and confusing API, especially given the existing "+ 1" behavior of the Base 62 encoding. I've been staring at the code in demangleFnSig+demangleBinder to figure out the first binder test case for a while before I realized that.
> > 
> > Any chance this could use uintmax for the invalid value, rather than offsetting the result in this way?
> There are two layers of `+1` which makes it quite confusing, but this is intended encoding of optional base62 numbers.
Fair enough - maybe some more quotations from the spec to explain this would be useful?


================
Comment at: llvm/test/Demangle/rust.test:237-238
+
+CHECK: binders::<for<'a> fn(&'a _)>
+       _RIC7bindersFG_RL0_pEuE
+
----------------
tmiasko wrote:
> dblaikie wrote:
> > I'm having a hard time understanding this test in the context of the spec. 
> > 
> > The spec says that `_` encodes zero. So the `G_` should be zero, right? Is there somewhere in the spec that says that `<binder>` mangled value is then 1 larger than that? (so a binder value 0 (`_`) is actually for 1 binder)?
> This looks like an omission in the specification The binder production should be marked as optional, and then the presence of `G` participates in the encoding of the value in analogous way to `s` in disambiguators:
> 
> > If the disambiguator is not present, this index is 0. If it is of the form s_ then the index is 1. If it is of the form s<base-62-digit>_ then the index is <base-62-digit> + 2.
> 
> Does this clarify the situation?
> 
> I will look into updating the specification.
Ah, OK - yeah, spec update and some more words in the implementation (see other comment) would probably be handy. Thanks!


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D102729/new/

https://reviews.llvm.org/D102729



More information about the llvm-commits mailing list