[llvm-dev] [RFC] Adding range metadata to array subscripts.

Thu Mar 25 10:11:41 PDT 2021

On 3/25/21 8:53 AM, Clement Courbet wrote:
> To summarize here is how each proposal would look like, along with a list
> of pros and cons for each.
>
> %struct.S = type { i32, [2 x i32], i32 }
>
> *inrange*
> define void @with_inrange(%struct.S* %s, i32 %in_array) {
>    %gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 1,
> inrange i32 %in_array
>    %gep2 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0
>    ret void
> }
>
> (Note that this is not currently valid IR, we need to make `inrange` work
> on standalone GEP instructions)
> pros:
>   - compact, no extra instructions
> cons:
>   - The range information is local to the GEP instruction, and cannot be
> easily used for other purposes, e.g. SCEV (the ValueTracker is not aware of
> it)

This seems fine to me, SCEV and others can learn about this.
I assume we'll want the range `llvm.assume` anyway but that doesn't
necessarily preclude this.

> *annotation with range metadata*
> define void @with_annotation(%struct.S* %s, i32 %index) {
>    %in_array = call i32 @llvm.annotation.i32(%index), !range !0
>    %gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 1,
> i32 %in_array
>    %gep2 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0
>    ret void
> }
>
> pros:
>   - a single added instruction
>   - the ValueTracker already knows about range metadata
>   - deriving the range in ValueTracker involves a single metadata lookup.
>   - The range information will be available for other purposes (e.g. SCEV).
> cons:
>   - one extra instruction
>   - llvm.annotation is not widely used

This is my least favorite. It introduces a completely new concept
for something that should be attached to the GEP or assume.

>
>
> *assume*
> define void @with_assume(%struct.S* %s, i32 %in_array) {
>    %cmp1 = icmp sge i32 %in_array, 0
>    %cmp2 = icmp slt i32 %in_array, 2
>    %cmp = and i1 %cmp1, %cmp2
>    call void @llvm.assume(i1 %cmp)
>    %gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 1,
> i32 %in_array
>    %gep2 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0
>    ret void
> }
>
> pros:
>   - The range information will be available for other purposes (e.g. SCEV).
> cons:
>   - several extra instructions
>   - the ValueTracker cannot currently derive a range from this
>   - Deriving the range can be costly
> <https://github.com/llvm/llvm-project/blob/dd388ba3e0b0a5f06565d0bcb6e1aebb5daac065/llvm/lib/Analysis/ValueTracking.cpp#L638>

This is the old assume way and OK. I don't think the cost of
looking at assumptions is a good argument because we should
come up with solutions there anyway. The extra instructions
and uses are the real downside here.

>
> *assume with bundle*
> define void @with_assume_bundle(%struct.S* %s, i32 %in_array) {
>    call void @llvm.assume(i1 true) ["range"(i32 %in_array, i32 0, i32 2)]
>    %gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 1,
> i32 %in_array
>    %gep2 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0
>    ret void
> }
>
> (Note that this is not currently valid IR, we need to make bundle tags work
> for non-attributes)
> pros:
>   - The range information will be available for other purposes (e.g. SCEV).
>   - ? (see below)
> cons:
>   - one extra instruction
>   - ? (see below)
>
> @johannes For the last option, I'm also not sure what becomes of the
> "range" tag in the assume bundle once we remove the assertion ? Does it
> become attached to the value (%in_array), in which case we have the
> advantages of range metadata (option 2), or is this still attached to the
> assume, with the disadvantages of option 3 ?

So, as I said before, we'll have to make bundles work for non-attribute
tags anyway, that should not be a negative point here. The extra instruction
is there but we already know how to combine multiple `llvm.assume`s into 
one.

The tag of the bundle describes the kind of assumption. In general, it could
be pretty much anything, the user needs to know how to interpret the 
arguments.

All arguments of a bundle are "attached" to the tag, think of the tag as a
function name of an uninterpreted function. The "range" one could look 
like this
```
void range(int v, int lb, int ub) {
   __builtin_assume(lb <= v && v < ub);
}
```

Later I'm confident we even want such explicit assumption functions, so 
you can
write the following, given the "range" definition above is in the module.
`call void @llvm.assume(i1 true) ["assumption_fn"(@range, i32 %in_array, 
i32 0, i32 2)]`
For more justification see: 
https://lists.llvm.org/pipermail/llvm-dev/2019-December/137632.html

I'm not sure I understand the questions properly. It basically is range 
metadata attached
to an arbitrary value for a fixed program point. Could you elaborate on 
your comments wrt.
the connection to option 2 and 3, please.

~ Johannes

> On Wed, Mar 24, 2021 at 8:32 PM Johannes Doerfert <
> johannesdoerfert at gmail.com> wrote:
>
>> On 3/24/21 12:47 PM, Florian Hahn wrote:
>>>> On Mar 24, 2021, at 15:16, Johannes Doerfert via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>>>
>>>> On 3/24/21 9:06 AM, Clement Courbet wrote:
>>>>> On Wed, Mar 24, 2021 at 2:20 PM Johannes Doerfert <
>>>>> johannesdoerfert at gmail.com> wrote:
>>>>>
>>>>>> I really like encoding more (range) information in the IR,
>>>>>> more thoughts inlined.
>>>>>>
>>>>>> On 3/24/21 4:14 AM, Clement Courbet via llvm-dev wrote:
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> tl;dr: I would like to teach clang to output range metadata so that
>> LLVM
>>>>>>> can do better alias analysis. I have a proposal as D99248
>>>>>>> <https://reviews.llvm.org/D99248> (clang part) and D99247
>>>>>>> <https://reviews.llvm.org/D99247> (llvm part). But there are other
>>>>>> possible
>>>>>>> options that I'm detailing below.
>>>>>>>
>>>>>>> Consider the following code, adapted from brotli
>>>>>>> <https://en.wikipedia.org/wiki/Brotli>:
>>>>>>>
>>>>>>> ```
>>>>>>>
>>>>>>> struct Histogram {
>>>>>>>
>>>>>>>      int values[256];
>>>>>>>
>>>>>>>      int total;
>>>>>>>
>>>>>>> };
>>>>>>>
>>>>>>> Histogram DoIt(const int* image, int size) {
>>>>>>>
>>>>>>>      Histogram histogram;
>>>>>>>
>>>>>>>      for (int i = 0; i < size; ++i) {
>>>>>>>
>>>>>>>        ++histogram.values[image[i]];  // (A)
>>>>>>>
>>>>>>>        ++histogram.total;             // (B)
>>>>>>>
>>>>>>>      }
>>>>>>>
>>>>>>>      return histogram;
>>>>>>>
>>>>>>> }
>>>>>>> ```
>>>>>>>
>>>>>>> In this code, the compiler does not know anything about the values of
>>>>>>> images[i], so it assumes that 256 is a possible value for it. In that
>>>>>> case,
>>>>>>> (A) would change the value of histogram.total, so (B) has to load,
>> add
>>>>>> one
>>>>>>> and store [godbolt <https://godbolt.org/z/KxE343>].
>>>>>>>
>>>>>>> Fortunately, C/C++ has a rule that it is invalid (actually, UB) to
>> use
>>>>>>> values to form a pointer to total and dereference it. What valid
>> C/C++
>>>>>> code
>>>>>>> is allowed to do with values is:
>>>>>>>     - Form any pointer in [values, values + 256].
>>>>>>>     - Form and dereference any pointer in [values, values + 256)
>>>>>>>
>>>>>>> Note that the LLVM memory model is much laxer than that of C/C++. It
>> has
>>>>>> no
>>>>>>> notion of types. In particular, given an LLVM aggregate definition:
>>>>>>>
>>>>>>> ```
>>>>>>> %struct.S = type { [42 x i32], i32, i32 }
>>>>>>> ```
>>>>>>>
>>>>>>> It is perfectly valid to use an address derived from a GEP(0,0,%i)
>> [gep
>>>>>>> reference] representing indexing into the [42 x i32] array to load
>> the
>>>>>> i32
>>>>>>> member at index 2. It is also valid for %i to be 43 (though not 44
>> if an
>>>>>>> inbound GEP is used).
>>>>>>> So clang has to give LLVM more information about the C/C++ rules.
>>>>>>>
>>>>>>> *IR representation:*
>>>>>>> LLVM has several ways of representing ranges of values:
>>>>>>>     - *!range* metadata can be attached to integer call and load
>>>>>> instructions
>>>>>>> to indicate the allowed range of values of the result. LLVM's
>>>>>> ValueTracking
>>>>>>> provides a function for querying the range for any llvm::Variable.
>>>>>>>     - The *llvm.assume* intrinsic takes a boolean condition that can
>> also
>>>>>> be
>>>>>>> used by ValueTracking to infer range of values.
>>>>>>>     - The *inrange* attribute of GEP can be used to indicate C-like
>>>>>> semantics
>>>>>>> for the structure field marked with the inrange attribute. It can
>> only be
>>>>>>> used for GEP constantexprs (ie.e. GEPs defined inline), but not for
>>>>>>> standalone GEPs defining instructions.  relevant discussion
>>>>>>> <https://reviews.llvm.org/D22793?id=65626#inline-194653>.
>>>>>>>
>>>>>>> Alternatives:
>>>>>>> *(1) *Annotate each array subscript index value with a range, e.g.:
>>>>>>> ```
>>>>>>> %i = i64 …
>>>>>>> %ri =  call i64 @llvm.annotation.i64(%index), !range !0
>>>>>>> %gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32
>> 0,
>>>>>> i32
>>>>>>> %ri
>>>>>>> ...
>>>>>>> !0 = !{i64 0, i64 42}
>>>>>>> ```
>>>>>>> *(2) *(variant of 1) relax the constraint that !range metadata can
>> only
>>>>>> be
>>>>>>> set on call and load instructions, and set the !range metadata on the
>>>>>> index
>>>>>>> expression. We still need annotations for function parameters though:
>>>>>>> ```
>>>>>>> %i = i64 … , !range !0
>>>>>>> %gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32
>> 0,
>>>>>> i32
>>>>>>> %i
>>>>>>> ...
>>>>>>> !0 = !{i64 0, i64 42}
>>>>>>> ```
>>>>>>> This is slightly more compact.
>>>>>>>
>>>>>>> *(3)* Same as (1), with llvm.assume. This feels inferior to
>> annotations.
>>>>>>> *(4)* Extend inrange to non-constantexprs GEPs. It is unclear how
>> this
>>>>>> will
>>>>>>> interfere with optimizations.
>>>>>> I would very much like not to introduce another way to encode
>>>>>> assumptions other than `llvm.assume`. If you want to avoid the extra
>>>>>> instructions, use `llvm.assume(i1 true) ["range"(%val, %lb, %ub)]`,
>>>>>> which is in line with our move towards operand bundle use.
>>>>>>
>>>>> Thanks, I did not know about that. I've just tried it but it appears
>> that
>>>>> tags have to be attribute names, and `!range` is not a valid attribute,
>>>>> it's a metadata node. Is there a way to encode this ?
>>>> We need to get rid of that assertion. There are other non-attributes
>>>> to be used in assume operand bundles in the (near) future, so the this
>>>> work has to be done anyway.
>>> +1 on trying to use assume, rather than adding another way.
>>>
>>> But are value ranges special for assumes, so that we need to handle them
>> in a bundle? Is that just so we can easier skip ‘artificial’ assume users?
>>
>> It would make users explicit and we will have non-attribute bundles anyway.
>> I find it also "conceptually nicer", would you prefer explicit
>> instructions?
>>
>> ~ Johannes
>>
>>
>>> Cheers,
>>> Florian