[clang] [compiler-rt] [ubsan] Display correct runtime messages for negative _BitInt (PR #96240)

Mon Sep 2 07:12:05 PDT 2024

jakubjelinek wrote:

I'm not suggesting to encode the number of limbs anywhere, I'm suggesting encoding the bit precision of a limb somewhere.  And the limb ordering.
On little endian of bits in a limb and little endian ordering of limbs in the limb array, at least if the limbs are sane (have precision multiple of char precision and there are no padding bits in between), the actual limb precision might seem to be irrelevant, all you care about is the N from {,{un,}signed }_BitInt(N) and whether it is unsigned or signed, so you can treat the passed pointer say as an array of 8-bit limbs, N / 8 limbs with full 8 bits and if N % 8, the last limb containing some further bits (in some ABIs that will be required to be sign or zero extended, in other ABIs the padding bits will be undefined, but on the libubsan side you can always treat them as undefined and always manually extend).
Or treat it as 16-bit limbs, or 32-bit limbs, or 64-bit limbs, or 128-bit limbs, for the higher perhaps with doing the limb reads again using internal_memcpy so that you don't impose some alignment requirement perhaps the target doesn't have.
But on big-endian, I think knowing the limb precision/size is already essential (sure, just a theory for now, GCC right only only supports _BitInt on little-endian targets because those are the only ones that have specified their ABI).
E.g. I believe _BitInt(513) big-endian with big-endian limb ordering would be for 32-bit limbs 17 limbs, the first one containing just one bit (the most significant of the whole number) and the remaining ones each 32 bits, while for 64-bit limbs 9 limbs, the first one containing just one bit and the remaining ones 64 bits each; and for 128-bit limbs 5 limbs, the first one just one bit, the remaining 128 bits each.  You can't decode these without knowing the limb size, the data looks different in memory.
And then there is the possibility of big-endian limbs with little-endian ordering of the limbs in the array.
As the 15 bits of the current precision used e.g. for normal integers is clearly insufficient  to express supported BITINT_MAXWIDTH (8388608 in clang, 65535 right now in GCC), my suggestion is to use another bit for the limb ordering
(say 0 little endian, 1 big endian) and the reaming 14 bits for the limb precision (whether log2 or not doesn't matter that much).
As for the actual _BitInt precision after the type name, one option is what you currently implemented, i.e. always use 32-bit integer in memory there, plus the extra '\0' termination if you really think it is needed, IMHO it is just a waste, and another
option is to use say uleb128 encoding of it.

https://github.com/llvm/llvm-project/pull/96240