[llvm-dev] [RFC] Adding range metadata to array subscripts.

Wed Mar 24 11:57:45 PDT 2021

On Wed, Mar 24, 2021 at 4:20 AM Clement Courbet via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

>
>
> On Wed, Mar 24, 2021 at 12:00 PM Nuno Lopes <nunoplopes at sapo.pt> wrote:
>
>> In theory, inrange should be perfect for this job. If inrange was
>> extended to non-const geps, I think it would work for all cases you care
>> about, right?
>>
>
> Agreed, but inrange was not extended to non-const geps. I'm not sure
> exactly why. From what I understand of the previous discussions, there were
> non-trivial issues with supporting `inrange` for non-const geps, though
> it's not exactly clear to me what these were. Peter, what was the blocker
> for that ?
>

I don't recall any non-trivial issues. Reading the previous discussion, the
only issues were mechanical ones, i.e. we would need an IR representation
of inrange for non-constant GEPs together with printing/parsing/bitcode
support.

Peter

>
>> It’s also the most compact representation, so I would prefer that
>> solution. Better than adding a new call instruction for every single
>> pointer arithmetic instruction.
>>
>> In terms of BasicAA, supporting inrange shouldn’t be any different from
>> !range metadata I think.
>>
>
> Yes, for the purpose of BasicAA `inrange` is a special case of range,
> where the range happens to be the size of the underlying `inrange` field.
>
>
>>
>>
>> Just my 2c.
>>
>>
>>
>> Nuno
>>
>>
>>
>>
>>
>> *From:* Clement Courbet
>> *Sent:* 24 March 2021 09:15
>> *To:* llvm-dev <llvm-dev at lists.llvm.org>
>> *Subject:* [llvm-dev] [RFC] Adding range metadata to array subscripts.
>>
>>
>>
>> Hi everyone,
>>
>>
>>
>> tl;dr: I would like to teach clang to output range metadata so that LLVM
>> can do better alias analysis. I have a proposal as D99248
>> <https://reviews.llvm.org/D99248> (clang part) and D99247
>> <https://reviews.llvm.org/D99247> (llvm part). But there are other
>> possible options that I'm detailing below.
>>
>>
>>
>> Consider the following code, adapted from brotli
>> <https://en.wikipedia.org/wiki/Brotli>:
>>
>>
>>
>> ```
>>
>> struct Histogram {
>>
>>   int values[256];
>>
>>   int total;
>>
>> };
>>
>>
>>
>> Histogram DoIt(const int* image, int size) {
>>
>>   Histogram histogram;
>>
>>   for (int i = 0; i < size; ++i) {
>>
>>     ++histogram.values[image[i]];  // (A)
>>
>>     ++histogram.total;             // (B)
>>
>>   }
>>
>>   return histogram;
>>
>> }
>>
>> ```
>>
>>
>>
>> In this code, the compiler does not know anything about the values of
>> images[i], so it assumes that 256 is a possible value for it. In that case,
>> (A) would change the value of histogram.total, so (B) has to load, add one
>> and store [godbolt <https://godbolt.org/z/KxE343>].
>>
>>
>>
>> Fortunately, C/C++ has a rule that it is invalid (actually, UB) to use
>> values to form a pointer to total and dereference it. What valid C/C++ code
>> is allowed to do with values is:
>>
>>  - Form any pointer in [values, values + 256].
>>  - Form and dereference any pointer in [values, values + 256)
>>
>>
>>
>> Note that the LLVM memory model is much laxer than that of C/C++. It has
>> no notion of types. In particular, given an LLVM aggregate definition:
>>
>> ```
>> %struct.S = type { [42 x i32], i32, i32 }
>> ```
>>
>> It is perfectly valid to use an address derived from a GEP(0,0,%i)  [gep
>> reference] representing indexing into the [42 x i32] array to load the i32
>> member at index 2. It is also valid for %i to be 43 (though not 44 if an
>> inbound GEP is used).
>>
>> So clang has to give LLVM more information about the C/C++ rules.
>>
>>
>>
>> *IR representation:*
>> LLVM has several ways of representing ranges of values:
>>  - *!range* metadata can be attached to integer call and load
>> instructions to indicate the allowed range of values of the result. LLVM's
>> ValueTracking provides a function for querying the range for any
>> llvm::Variable.
>>  - The *llvm.assume* intrinsic takes a boolean condition that can also
>> be used by ValueTracking to infer range of values.
>>  - The *inrange* attribute of GEP can be used to indicate C-like
>> semantics for the structure field marked with the inrange attribute. It can
>> only be used for GEP constantexprs (ie.e. GEPs defined inline), but not for
>> standalone GEPs defining instructions.  relevant discussion
>> <https://reviews.llvm.org/D22793?id=65626#inline-194653>.
>>
>>
>>
>> Alternatives:
>> *(1) *Annotate each array subscript index value with a range, e.g.:
>>
>> ```
>> %i = i64 …
>> %ri =  call i64 @llvm.annotation.i64(%index), !range !0
>> %gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0,
>> i32 %ri
>> ...
>> !0 = !{i64 0, i64 42}
>>
>> ```
>> *(2) *(variant of 1) relax the constraint that !range metadata can only
>> be set on call and load instructions, and set the !range metadata on the
>> index expression. We still need annotations for function parameters though:
>>
>> ```
>> %i = i64 … , !range !0
>> %gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0,
>> i32 %i
>> ...
>> !0 = !{i64 0, i64 42}
>>
>> ```
>> This is slightly more compact.
>>
>> *(3)* Same as (1), with llvm.assume. This feels inferior to annotations.
>> *(4)* Extend inrange to non-constantexprs GEPs. It is unclear how this
>> will interfere with optimizations.
>>
>>
>>
>>
>>
>> *On the clang side*:
>>
>> The clang part is quite trivial as the infrastructure is already in place
>> to emit dynamic ubsan guards: D99248 <https://reviews.llvm.org/D99248>
>>
>>
>>
>> *On the LLVM Side:*
>>
>> Alternatives:
>> *(A)* - (annotation or assume options) Simply enable scev-aa which knows
>> how to handle value ranges in general. IIUC it's not enabled in clang
>> because it has issues with invalidation when code changes, and is therefore
>> not cacheable. This makes it too slow to be practical.
>> *(B) *- (annotation or assume options) Teach BasicAA to honor !range
>> metadata (D99247 <https://reviews.llvm.org/D99247>)
>> *(C)* - (inrange option) Teach BasicAA to honor inrange attributes of
>> GEP.
>>
>>
>>
>> I was leaning towards (1) and (B) because:
>>
>>  - BasicAA already has basic support for value range analysis
>> (zero/nonzero), this is a small and natural extension.
>>
>>  - The BasicAA improvement already benefits some existing code (as
>> evidenced by the test changes in D99247 <https://reviews.llvm.org/D99247>
>> )
>>
>>  - Using range metadata rather than the `inrange` attribute means that
>> BasicAA will automatically benefit from improvements in value tracking in
>> the future.
>>
>>
>>
>> Opinions ?
>>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

-- 
-- 
Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210324/1b52e39b/attachment.html>