[libcxx-commits] [PATCH] D99091: [locale][num_get] Improve Stage 2 of string to float conversion

Fri May 28 08:07:00 PDT 2021

tmatheson added a comment.

Ping. @ldionne? It doesn't seem like there is a lot of interest in this fix, since it's been up for over 2 months now in more or less the same form. I'm happy to keep making the suggested changes if it has a chance of landing, otherwise please let me know and I will just close it.

================
Comment at: libcxx/test/std/localization/locale.categories/category.numeric/locale.num.get/facet.num.get.members/get_float.pass.cpp:50
+    TEST("-inF", 4, -INFINITY, ios.goodbit | ios.eofbit);
+    TEST("INFxyz", 3, INFINITY, ios.goodbit);
+
----------------
Quuxplusone wrote:
> tmatheson wrote:
> > Quuxplusone wrote:
> > > I'd still like to see the test strings `"INFINITE"` and `"INFINITY"` somewhere in these tests. Also `INAN`, now that I think of it (because it looks like it's gonna be "INF" and then switches to "NAN" in the middle) — and any other "white-box" test cases you can think of.
> > > Also every-possible-prefix-of-a-correct string: `0x`, `123e`, `123e+`, `123e-`. (Maybe these are already covered somewhere?)
> > I'll try to explain what the issue is with adding "intinite" and "infinity" tests -- besides the fact that "infinity" worked before I removed it on request :)
> > 
> > On one hand, we have the ideal behaviour, which is that `do_get` handles anything that `strtod` can handle.
> > 
> > Then we have the over-specified wording in the standard, which actually defines `do_get` in three stages, and requires us to (paraphrasing) "loop until we reach an invalid character, then stop and pass the substring to `strtod` for parsing."
> > 
> > (The [[ https://timsong-cpp.github.io/cppwp/n4140/facet.num.get.virtuals#3.2 |actual wording]]: "a check is made to determine if c is allowed as the next character of an input field of the conversion specifier returned by Stage 1. If so, it is accumulated.")
> > 
> > This wording in the standard is probably why we have this slightly bizarre coding style in the first place, with the body of the loop factored out (and the signature added to the ABI).
> > 
> > With a garden path input like "infinite" you need to be able to backtrack in stage 2, and reject all the characters that you previously considered valid (e.g. "init") until you get to a valid substring. The parser described in the standard is simply not up to the job.
> > 
> > You might think we can err on the side of caution, and accumulate more characters than we need, and let `strtod` use what it wants (e.g. halt stage 2 after accumulating "infinit" and let `strtod` just parse "inf"). But no, it must return zero and give you an error: https://timsong-cpp.github.io/cppwp/n4140/facet.num.get.virtuals#3.3.4
> > 
> > As such I can see *no way* that we can actually meet the following criteria simultaneously:
> > * Be compliant with the standard, e.g. stick to the algorithm it describes
> > * Handle all valid floating point inputs
> > * Always return the same result as `strtod` would
> > 
> > The initial problem I wanted to solve is that "1.2f" is badly parsed. This is not a valid floating point number. "1.2" is valid. The "f" should be ignored, i.e. not passed to stage 2. It was not ignored because "f" is a valid hex digit. But the existing code had no way of keeping track of whether hex digits are valid at the current point in the string. I have added this here because it is relatively simple to do without a complete refactor, and it still looks like the standard if you squint. Adding the ability to backtrack to handle garden path inputs is a significant deviation from the standard imo.
> > 
> > This change should increase the number of correctly handled cases and hopefully not break any that were handled correctly before, but it is not perfect or complete.
> I'd still like to see those tests, because otherwise we have codepaths that we're not exercising — which means they could segfault, or worse, for all we know. I see two not-mutually-exclusive alternatives:
> 
> - Figure out the "least common correct denominator" of all implementations, and add a minimal coverage test in test/std/ for it. If the least common denominator is just "we should be able to call it without crashing," then test `try { (void)thething("INFINITY"); } catch (...) {}`. But we should still //hit the codepath//.
> 
> - And/or, figure out the "current behavior of libc++," and add a regression test in test/libcxx/, so that we test //our// behavior, and then we'll know if it ever regresses by accident. Because we do care about regressions in this area, right? (That's why you're introducing the ABI flag in the first place.)
> 
> The responsible thing to do would probably be to add //both// of these approaches, honestly, now that I've said them out loud.
Sure, I can add something like this to test/std:

```
// FIXME these are all currently known to give the wrong answer
TEST("INFINITY", 8, 0.0, ios.failbit);

// FIXME it is unclear what the correct answer is in these cases
TEST("INAN", 1, 0.0, ios.failbit);
TEST("INFINITY", 3, INFINITY, ios.goodbit);
```

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99091/new/

https://reviews.llvm.org/D99091