[libc-commits] [PATCH] D112176: [libc] fix strtol returning the wrong length

Thu Oct 21 21:13:55 PDT 2021

sivachandra accepted this revision.
sivachandra added inline comments.
This revision is now accepted and ready to land.

================
Comment at: libc/src/__support/str_conv_utils.h:45
 // Takes the address of the string pointer and parses the base from the start of
 // it. This will advance the string pointer.
 static inline int infer_base(const char *__restrict *__restrict src) {
----------------
Replace

```
This will advance the string pointer.
```

with

```
This function will advance |src| to the first valid digit in the inferred base.
```

================
Comment at: libc/src/__support/str_conv_utils.h:83
+    seen_digit = true;
+  }
+
----------------
michaelrj wrote:
> sivachandra wrote:
> > michaelrj wrote:
> > > sivachandra wrote:
> > > > Do we need this `if` block at all?
> > > it's to handle the case where the number is just "0" and the base is set to automatic. Without this, then that would be parsed as an octal number with no digits, as opposed to a decimal number with one digit. I've added a test to check this.
> > > The other part of the condition was unnecessary though.
> > May be `infer_base` is incorrect then? Consider this:
> > 
> > ```
> > const char *n = "08";
> > char *next;
> > long i = ::strtol(n, &next, 0);
> > ```
> > 
> > Since "08" is an invalid number, should we want the above `strtol`call  to succeed or a fail? If it is a failure, should `next` point to 8 or 0? Should it be the same if the base was explicitly specified to be 8?
> > 
> > I also think we can ask the same questions when base is 16.
> > 
> > My reading of the standard says that "08" is an invalid number so no conversion should be performed. Which means, `str_end` should point to `original_src`. Likewise, "0xZ" with base 16 is an invalid number so no conversion should be performed and `str_end` should point to `original_src`.
> `infer_base` was incorrect, and I have now fixed it so that it doesn't skip over the leading octal 0, and removed this condition.
> 
> From what I can tell, in the case of both "08" and "0xZ", when parsed with an automatic base, what we have is a solo octal 0 (and this matches what I see from other libc implementations). 
> 
> In the C standard §6.4.4.1 it specifies that "A decimal constant begins with a nonzero digit", meaning that those numbers cannot be assumed to be base 10, but that "An octal constant consists of the prefix 0 optionally followed by a sequence of the digits 0 through 7 only". This means that the longest valid interpretation of "08" is that it is an octal 0 followed by some value that is not a valid digit (equivalent to "0Z").
> 
> For "0xZ" it is similar, a hexadecimal number has to have a valid hexadecimal digit after the "0x" so this is just 0.
Ah, yes!

I am reading two different parts of the standard in isolation. 6.4.4.1 Has this text:

```
An octal constant consists of the prefix 0 optionally followed by a sequence of the
digits 0 through 7 only. A hexadecimal constant consists of the prefix 0x or 0X followed
by a sequence of the decimal digits and the letters a (or A) through f (or F) with values
10 through 15 respectively.
```

Which makes `08` an invalid literal. But, the spec for `strto<int>` functions also says:

```
decompose the input string into three parts: an initial, possibly empty, sequence of
white-space characters (as specified by the isspace function), a subject sequence
resembling an integer represented in some radix determined by the value of base, and a
final string of one or more unrecognized characters, including the terminating null
character of the input string.
```

So, in the string `"08"`, `'8'` is the first unrecognized character, so the string consists of only one valid base 8 character `"0"`. Combining the above two parts of the standard, `"08"` should be interpreted as just the octal zero. Which is what you are saying I think.

This same argument does not hold for `"0x"` or `"0xZ"` because just the prefix `0x` or the number `0xZ` are not hex numbers at all. But, the octal argument kicks in here as well: so `"0x"` is octal zero, and `"0xZ"` is also octal zero. I just checked what other libcs do and it looks like glibc atleast follows this argument. In both the cases, the `end_ptr` is updated by only 1 implying that only one valid character was read.

If you are convinced with this, can you update `infer_base` to reflect all of this. Also add comments quoting the appropriate sections in the standard.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D112176/new/

https://reviews.llvm.org/D112176