[PATCH] D78796: [Support] Refactor LEB128 encoding into an input iterator

Nicolas Guillemot via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Wed Feb 3 17:34:47 PST 2021


nlguillemot marked an inline comment as done.
nlguillemot added inline comments.


================
Comment at: llvm/include/llvm/Support/LEB128.h:26-27
+template <class ValueT> class LEB128InputIterator {
+  /// Denotes whether this object represents an end() iterator.
+  bool IsEnd;
+
----------------
dsanders wrote:
> We could fold this into Optional<uint8_t> CurrByte
Implement in the latest update: Removed the `IsEnd` member and turned `CurrByte` into an `Optional` instead, where `CurrByte == None` means the same thing as `IsEnd == true` did.


================
Comment at: llvm/include/llvm/Support/LEB128.h:32-33
+
+  /// Whether this iterator is outputting a signed or unsigned LEB128 data.
+  bool IsSigned;
+
----------------
dsanders wrote:
> It may be worth mentioning that this is needed for types that don't carry signedness like APInt. unsigned/int/etc. wouldn't need it
Added a mention of this in the comments.


================
Comment at: llvm/include/llvm/Support/LEB128.h:35-36
+
+  /// Whether there will be more output after the previously outputted byte.
   bool More;
+
----------------
nlguillemot wrote:
> dsanders wrote:
> > This is misleading as the iterator can produce more bytes after this becomes false. I think it's also unnecessary as each time encodeNextByte() shifts Value right by 7 it's bringing in the correct padding bits at the top. We could just keep reading from Value for the padding bytes.
> > his is misleading as the iterator can produce more bytes after this becomes false. 
> 
> For context, I named it that way because that's how it was named in the original implementation of the code. I agree it could be better.
> 
> 
Updated the name of `More` to make more sense and slightly simplified the logic surrounding it. Also the padding bits now come from the 7 least significant bits like you suggested.


================
Comment at: llvm/include/llvm/Support/LEB128.h:38-42
+  /// The output will be sext-ed/zext-ed to this number of bytes if necessary.
+  unsigned PadTo;
+
+  /// The current number of outputted bytes.
+  unsigned Count;
----------------
nlguillemot wrote:
> dsanders wrote:
> > Would it make sense to have a RemainingBytes that counts down to zero? Is there a reason to keep PadTo and Count separate?
> For context, the `Count` and `PadTo` variables are there to match the original implementation of the code. Maybe could use a refactoring.
I tried implementing `RemainingBytes` instead of `PadTo`/`Count`, but the problem is that we don't actually know how many bytes are remaining because the algorithm encodes LEB128 bytes one-by-one. `PadTo` is usually equal to 0 which results in having no padding.


================
Comment at: llvm/include/llvm/Support/LEB128.h:50
+    CurrByte = Value & 0x7f;
+    if (IsSigned) {
+      int64_t SValue = static_cast<int64_t>(Value);
----------------
nlguillemot wrote:
> mkitzan wrote:
> > dsanders wrote:
> > > Is testing this on each byte measurably slower? It seems unlikely but this is called a lot and I notice that the previous code didn't do it. If it does, it would be good if we can have the template instantiation pick one side or the other
> > `IsSigned` could easily be changed to a template parameter. Could then turn this is into `if constexpr (IsSigned)` (if you can use C++17), or just SFINAE it. Better yet, use [[ https://en.cppreference.com/w/cpp/types/is_signed | `is_signed` type trait ]] on `ValueT`.
> Last I heard, we have to support C++14, so no `if constexpr` fanciness allowed. :(
> 
> Using `is_signed` would work for types like `uint64_t` and `int64_t`, but not `APInt`, so we somehow need to figure out how to handle that case.
> 
> Seems like there's general agreement that it might as well be a template parameter though.
`IsSigned` is now a template parameter.


================
Comment at: llvm/include/llvm/Support/LEB128.h:80
+  /// Constructs the end-of-input iterator.
+  LEB128InputIterator() : IsEnd(true) {}
+
----------------
nlguillemot wrote:
> mkitzan wrote:
> > This constructor leaves other data members uninitialized which may be fine for an end iterator, but is a code smell [[ http://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Rc-complete | C.41 ]].
> I don't think this breaks that core guideline, since the invariants are set and the object is totally usable, though there should probably be an assert for `!IsEnd` at the start of `operator++` to enforce the API requirements.
> 
> On the other hand I agree it might be good to initialize the other members anyways, just to be on the safe side, and to avoid having a copy constructor that reads uninitialized memory. The alternative is to hand-write the copy constructor to avoid copying the other fields if `IsEnd` is set, which sounds ugly and less maintainable.
All members are now initialized to sensible values to avoid potential undefined behavior.


================
Comment at: llvm/include/llvm/Support/LEB128.h:93-95
+  explicit LEB128InputIterator(ValueT Value, bool IsSigned, unsigned PadTo)
+      : IsEnd(false), Value(std::move(Value)), IsSigned(IsSigned), PadTo(PadTo),
+        Count(0) {
----------------
mkitzan wrote:
> Would be a good idea to initialize `More` in the initializer list just in case something changes in the future and `More` is read before being initially assigned to in `encodeNextByte`. Could also apply to `CurrByte`...
All members are now initialized in all cases.


================
Comment at: llvm/include/llvm/Support/LEB128.h:121-136
+    // Pad with 0s or 1s depending on whether we want to zext or sext.
+    uint8_t PadValue;
+    if (IsSigned && static_cast<int64_t>(Value) < 0)
+      PadValue = 0x7f;
+    else
+      PadValue = 0x00;
+
----------------
dsanders wrote:
> I think this belongs inside encodeNextByte(). This operator should essentially be something like:
>   assert(CurrByte.hasValue() && "operator++() called on past-the-end LEB128InputIterator");
>   CurrByte = encodeNextByte();
>   return this;
> where CurrByte is Optional<uint8_t>. As noted in another comment, I think special-casing the padding bytes isn't really needed.
Moved the logic inside the core function and now the operator looks like what you described.


================
Comment at: llvm/include/llvm/Support/LEB128.h:132
+      // Add a continuation bit to signal that there is more padding after this.
+      if (Count < PadTo - 1)
+        CurrByte |= 0x80;
----------------
nlguillemot wrote:
> mkitzan wrote:
> > If `PadTo` is 0, may have unintentional underflow since it is `unsigned` (this may be you intention though, and if so it's worth a comment).
> > 
> > Seeing it a lot in the other code. Maybe underflow is part of the plan here?
> If `PadTo` is 0 then it's impossible for the check above of `if (Count < PadTo)` to pass, so we would never hit this line of code. This code should be safe at least. If there are other cases we should double-check them as well.
Avoided any potential underflow by switching the logic to `if (Count + 1 < PadTo)`


================
Comment at: llvm/include/llvm/Support/LEB128.h:155-157
+    // Both are end-of-input iterators, so they compare the same.
+    if (IsEnd && Other.IsEnd)
+      return true;
----------------
nlguillemot wrote:
> dsanders wrote:
> > I see no harm in allowing two past-the-end iterators from different sequences to be equal but I wonder if it's necessary/useful. With the ULEBifier object it could give you an end iterator and then this operator would be a plain equality comparison.
> The same design issue also happens with existing std input iterators like `std::istream_iterator`, so I think it's acceptable even if slightly odd.
After messing with it for a while I think it's simplest to keep it this way. Mainly because it simplifies the logic under it, since the member-by-member comparison doesn't have to worry about comparing the members of end iterators, or comparing the members of an end iterator with the members of a non-end iterator.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D78796/new/

https://reviews.llvm.org/D78796



More information about the llvm-commits mailing list