[libcxx-commits] [PATCH] D99091: [locale][num_get] Improve Stage 2 of string to float conversion

Tomas Matheson via Phabricator via libcxx-commits libcxx-commits at lists.llvm.org
Mon May 17 11:27:41 PDT 2021

tmatheson added a comment.

Thanks @Quuxplusone. If @ldionne can't be tempted into taking a look is there anyone else appropriate?

This patch is about correctness (still a long way to go, see below) rather than speed/performance so I won't add a benchmark.

Comment at: libcxx/test/std/localization/locale.categories/category.numeric/locale.num.get/facet.num.get.members/get_float.pass.cpp:50
+    TEST("-inF", 4, -INFINITY, ios.goodbit | ios.eofbit);
+    TEST("INFxyz", 3, INFINITY, ios.goodbit);
Quuxplusone wrote:
> I'd still like to see the test strings `"INFINITE"` and `"INFINITY"` somewhere in these tests. Also `INAN`, now that I think of it (because it looks like it's gonna be "INF" and then switches to "NAN" in the middle) — and any other "white-box" test cases you can think of.
> Also every-possible-prefix-of-a-correct string: `0x`, `123e`, `123e+`, `123e-`. (Maybe these are already covered somewhere?)
I'll try to explain what the issue is with adding "intinite" and "infinity" tests -- besides the fact that "infinity" worked before I removed it on request :)

On one hand, we have the ideal behaviour, which is that `do_get` handles anything that `strtod` can handle.

Then we have the over-specified wording in the standard, which actually defines `do_get` in three stages, and requires us to (paraphrasing) "loop until we reach an invalid character, then stop and pass the substring to `strtod` for parsing."

(The [[ https://timsong-cpp.github.io/cppwp/n4140/facet.num.get.virtuals#3.2 |actual wording]]: "a check is made to determine if c is allowed as the next character of an input field of the conversion specifier returned by Stage 1. If so, it is accumulated.")

This wording in the standard is probably why we have this slightly bizarre coding style in the first place, with the body of the loop factored out (and the signature added to the ABI).

With a garden path input like "infinite" you need to be able to backtrack in stage 2, and reject all the characters that you previously considered valid (e.g. "init") until you get to a valid substring. The parser described in the standard is simply not up to the job.

You might think we can err on the side of caution, and accumulate more characters than we need, and let `strtod` use what it wants (e.g. halt stage 2 after accumulating "infinit" and let `strtod` just parse "inf"). But no, it must return zero and give you an error: https://timsong-cpp.github.io/cppwp/n4140/facet.num.get.virtuals#3.3.4

As such I can see *no way* that we can actually meet the following criteria simultaneously:
* Be compliant with the standard, e.g. stick to the algorithm it describes
* Handle all valid floating point inputs
* Always return the same result as `strtod` would

The initial problem I wanted to solve is that "1.2f" is badly parsed. This is not a valid floating point number. "1.2" is valid. The "f" should be ignored, i.e. not passed to stage 2. It was not ignored because "f" is a valid hex digit. But the existing code had no way of keeping track of whether hex digits are valid at the current point in the string. I have added this here because it is relatively simple to do without a complete refactor, and it still looks like the standard if you squint. Adding the ability to backtrack to handle garden path inputs is a significant deviation from the standard imo.

This change should increase the number of correctly handled cases and hopefully not break any that were handled correctly before, but it is not perfect or complete.

  rG LLVM Github Monorepo



More information about the libcxx-commits mailing list