[cfe-dev] libc++: max_size() of a std::vector

Tue Feb 17 19:09:49 PST 2015

On Tue, Feb 17, 2015 at 4:38 PM, François Fayard <fayard.francois at icloud.com
> wrote:

> Hi,
>
> I’ll try to give some arguments on why I think using an unsigned integer
> for indices is the worst idea one could have:
>
> 1) for(std::size_t k = v.size() - 1; k >=0; —k) is a nice infinite loop
>

All you've stated is that you can accidentally create an infinite loop if
you incorrectly write a loop. I'd rather have a data type that properly
fits my value's constraints and doesn't artificially limit the size of a
range.

> 2) for(std::size_t k = 0; k < v.size() - 1; ++k) might take some time if
> v.size() == 0
>

Again, all you're doing is writing an invalid loop. It doesn't make sense
to use a signed type just to support a poorly written loop.

In reference to both of your above examples, I could produce similar
mistakes with respect to signed types, only when size() is very close to
the largest ptrdiff_t value. They'd be more rare, but so what -- in either
case, choosing your datatype based on such an erroneous loop is choosing
the wrong solution to the wrong problem. In both cases, the *user* made a
mistake in the way that they wrote their loop, and it would be horrible to
use a datatype that doesn't actually match your range of values just
because of this. Rather than try to "do the write thing" when the user does
the wrong thing, the effort should instead be focused on making it so that
the user isn't doing the wrong thing to begin with. In other words, instead
of using a datatype that doesn't match the range, why not just encourage
proper usage of high-order functions so that users aren't frequently
writing silly loops like this. That's not to say that you won't still
encounter more manual loops, but I'd say that changing the datatype is a
poor solution to making incorrectly written manual loops "just work."

3) It pollutes your types and when you compare signed and unsigned
> integers, you are very close to a bug
>

Yes, this is important, but I argue that consistently using signed types is
not the ideal solution. Consistently using unsigned types in this case is.
I am very concerned about this, much more so than most people, and I cringe
whenever I even see casting between signed an unsigned types. In that
respect, I feel that we are very much in agreement. In my personal
experience, though, there is nothing at all difficult about consistently
using unsigned types and I genuinely feel that we would be better off in
the long run to embrace them more-so than currently is the case, and not
less. The only time I encounter problems is when *other* APIs use signed
types for everything.

4) Who needs modulo 2^n behaviour but people working with “bits”?
>

I agree that modulo 2^n behavior being required is an unfortunate
limitation regarding optimization, though it is also notably just a
language-imposed requirement -- we could just as easily have additional
unsigned types with UB on overflow in the language and have no potential
difference regarding efficiency when compared to signed types. If we had
that, there would be no difference in terms of optimizations, no artificial
limitation on sizes, and we'd be using a data type whose set of values more
accurately represents the set of possible sizes of our containers. IMO,
using a signed type here would just be a hack.

No, let’s get a little less technical. Watch this video
> http://www.youtube.com/watch?v=Puio5dly9N8 at time 42:38 and 1:02:50.
> You’ll find out that:
>

Yes, I'm aware, and I feel that they are misguided.

> Now, for the people defending the use of std::size_t which is unsigned,
> their arguments are often
>
> 1) Indices are nonnegative, therefore they should be unsigned. To those I
> would say :
>     - sqrt(x) does not mean anything for x < 0. Nobody invented an
> unsigned floating point number for that!
>

A couple of differences, here. For one, operations with operands generally
do not map from one datatype to another, apart from promotion (or
operations on things like quantities with units attached) because they are
usually a part of larger expressions where you need to be consistent. With
respect to something like "size()" returning an unsigned type, we have no
operand to be consistent with. It is effectively a terminal. Implying that
it simply *may* be used in signed integer expressions is not a compelling
argument, and in practice, I find it much more common to be used in
unsigned expressions. If one were to make size() yield a signed value, then
all of a sudden any time you use it in an expression with unsigned data
you'd be back to having similar mismatching concerns. I don't see what
advocates really believe they're gaining.

As well, a simple unsigned type that has UB on overflow would take no
effort to standardize, and it would get rid of the modular arithmetic and
optimization concerns -- this is not quite the case for a floating point
type. All you'd have to do for such an unsigned integral type is specify
that overflow is UB or implementation specified in the standard for such a
type. The only difficult part is actually convincing people that it's
worthwhile.

Finally, when writing code that deals with ranges over data structures,
which is what our examples have been alluding to here, the need for
negative values in full expressions and sets of statements that deal with
these range sizes is entirely unconvincing to me -- the examples you've
shown certainly do not imply any such need, they just show that some people
accidentally write incorrect loops. As said earlier, I've written a lot of
generic code and the only time I ever find myself having to deal with
signed/unsigned mismatches are when other APIs use signed datatypes when
they could/should have used an unsigned type.

    - v[k] still does not mean anything for k >= v.size()
>

I'm not sure I see your point here. You can still represent a very large
size when compared to your word size, which is not possible with a signed
type. With a signed type, on the other hand, half of all of your values are
always completely worthless. I realize that people frequently say "a large
size is uncommon," but to me that is an extremely weak argument and does
not warrant using a type that doesn't properly match the valid set of
values that you need to represent.

    - An unsigned integer is not just an integer which is signed. It is a
> member of the mathematical ring Z/(2^n)Z which is another way of saying
> that it does have modulo 2^n behaviour. It is not what people want when
> they use unsigned int. And having taught arithmetic and Z/pZ for many years
> to student, I can tell you, nobody wants to mess with that ring.
>

Agreed, but IMO *this* is what needs to be fixed. Making size() return a
signed type for this reason would, imo, not solve the *actual* problem.

2) You get another bit and therefore you can address more memory.
>     - Maybe, for the people who needs to work with arrays of chars of more
> than 2 GB on 32 bit systems...
>

This genuinely can be a problem and I think it's reasonable on its own as a
rationale for unsigned, though really it's the artificial change in the
range of values if you were to adopt a signed type that is much more
upsetting to me, personally, even though it's much more ideological.

    - But as you can see with this bug, it does fall apart.
>

Again, if users are incorrectly using the datatype, then *that* is what
needs to be addressed. By artificially changing the datatype to make a
common mistake "just work," you're creating other subtle problems,
minimizing the possible sizes that are able to be represented, and (in
practice) making function preconditions frequently more complicated because
certain constraints are no longer implied by the type's invariants.

3) It is faster too check if you are in bounds with unsigned integers
>     - No it is not. If k and n are std::size_t, the check is k < n. If k
> and n are std::ptrdiff_t, the check is static_cast<std::size_t>(k) <
> static_cast<std::size_t>(n) which gives the same assembly code.
>
4) Indices are useless anyway since we have iterators.
>     - I still use indices because:
>         - they don’t get invalidated when we copy arrays
>         - you need to use them when you want to use struct of arrays
> instead of array of structs.
>         - indices are much more rich than iterators: multiplication,
> division have no meaning for iterators.
>

Not much to say here other than that I personally wouldn't argue these.

I care about getting indices right because arrays are used so much in my
> field. It is too late for C++ but many people have been raised to thik that
> it is a wise thing to use unsigned int for array indices. Unfortunately
> Rust seems to make the same error :-( So it would be useful to say this
> things loudly *it was a huge mistake* and there is not a single benefit to
> it.
>

I realize that you feel it's an "error" but I strongly urge you to
reconsider. I agree with you 100% regarding modular arithmetic, and that to
me is the real problem. It is the only legitimately compelling reason to
avoid unsigned types that I ever hear. However, I'd much rather simply see
additional unsigned types that do not require this behavior on overflow as
a solution as opposed to, imo, using signed types instead of unsigned types
because you don't need modular arithmetic.

-- 
-Matt Calabrese
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20150217/c0fbd24e/attachment.html>