[llvm-dev] [RFC] IR-level Region Annotations

Fri Jan 20 10:45:15 PST 2017

In the case,   "val" is shared per OpenMP language rule. There is no privatization needed.    %val is on the stack of master,  to share %val among all threads,  &val is passed to the outlined function. 

void main() {
  int val;
  #pragma omp parallel num_threads(4)
  {
    // Really bad naming, won't pass code review. :)
    compute_something_into_val(&val, omp_get_thread_num());
  }
}

The IR would be. 

{ void main() {
  i32* val = alloca i32
   tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(), "QUAL.SHARED"(i32* %val), "QUAL.NUM_THREADS"(i32 4)
      %1 = omp_get_thread_num();
      compute_something_into_val(%val, %1);
   llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
}

Xinmin

-----Original Message-----
From: Sanjoy Das [mailto:sanjoy at playingwithpointers.com]
Sent: Thursday, January 19, 2017 11:40 PM
To: Tian, Xinmin <xinmin.tian at intel.com>
Cc: Adve, Vikram Sadanand <vadve at illinois.edu>; llvm-dev-request at lists.llvm.org
Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations

On Thu, Jan 19, 2017 at 11:27 PM, Sanjoy Das <sanjoy at playingwithpointers.com> wrote:
> Hi Xinmin,
>
> On Thu, Jan 19, 2017 at 11:20 PM, Tian, Xinmin <xinmin.tian at intel.com> wrote:
>> Sanjoy, the IR would be like something below. It is ok to hoist alloca instruction outside the region. There are some small changes in optimizer to understand region-annotation intrinsic.
>>
>> { void main() {
>>   i32* val = alloca i32
>>   tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(),
>> "QUAL.PRIVATE"(i32* val), "QUAL.NUM_THREADS"(i32 4)]
>>
>>   int i = omp_get_thread_num();
>>   compute_something_into_val(val, i);
>>   a[i] = val;
>>
>>   llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
>> }
>>
>> With above representation, we can do privatization and outlining as 
>> below
>>
>> { void main() {
>>   i32* val = alloca i32
>>   i32* I = alloca 32
>>   tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(),
>> "QUAL.PRIVATE"(i32* %val, i32 %i), "QUAL.NUM_THREADS"(i32 4)]
>>
>>   %ii = omp_get_thread_num();
>>   compute_something_into_val(%val, %i);
>>   a[i] = %val;
>>
>>   llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
>> }
>>
>> 1. create i32* priv_val = alloca i32  %priv_i = ...in the region, and replace all  %val with %prv_val in the region.
>> 2. perform outlining.
>
> But then what if compute_something_into_val is
>
> void compute_something_into_val(i32* ptr, i32 idx) {
>   static i32* cookie = null;
>   lock_mutex();
>   if (cookie == null)
>     cookie = ptr
>   else
>     assert(cookie == ptr);
>   unlock_mutex();
>   // don't write to ptr, so there is no race }
>
> In other words, how do you differentiate between the hoisted-alloca 
> situation arising due to a hoist vs. arising because that's what the 
> programmer intended (and you're required to pass in the same address 
> to each call into compute_something_into_val)?

Just to be a 100% clear, the source program in the latter case would have to be:

void main() {
  int val;
  #pragma omp parallel num_threads(4)
  {
    // Really bad naming, won't pass code review. :)
    compute_something_into_val(&val, omp_get_thread_num());
  }
}

-- Sanjoy