[LLVMdev] Instructions that cannot be duplicated

Mon Ping Wang monping at apple.com
Thu Oct 8 17:35:21 PDT 2009


For these particular targets, the hardware may not support function  
calls.  So, at some point, the compiler will need to inline the  
function.  It would need to do something special since it can't inline  
in each call site but need to introduce a jump to the barrier and a  
jump back to the various basic block it came from which could be  
messy.  It would be nicer to avoid the transformation that copied  
these basic block.

-- Mon Ping


On Oct 8, 2009, at 2:11 PM, Reid Kleckner wrote:

> On Thu, Oct 8, 2009 at 3:59 PM, Devang Patel  
> <devang.patel at gmail.com> wrote:
>> On Thu, Oct 8, 2009 at 11:28 AM, Villmow, Micah <Micah.Villmow at amd.com 
>> > wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Jeffrey Yasskin [mailto:jyasskin at google.com]
>>>> Sent: Thursday, October 08, 2009 11:09 AM
>>>> To: Villmow, Micah
>>>> Cc: LLVM Developers Mailing List
>>>> Subject: Re: [LLVMdev] Instructions that cannot be duplicated
>>>>
>>>> On Thu, Oct 8, 2009 at 10:49 AM, Villmow, Micah <Micah.Villmow at amd.com 
>>>> >
>>>> wrote:
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Eli Friedman [mailto:eli.friedman at gmail.com]
>>>>>> Sent: Wednesday, October 07, 2009 5:50 PM
>>>>>> To: Villmow, Micah
>>>>>> Cc: LLVM Developers Mailing List
>>>>>> Subject: Re: [LLVMdev] Instructions that cannot be duplicated
>>>>>>
>>>>>> On Wed, Oct 7, 2009 at 11:20 AM, Villmow, Micah
>>>>> <Micah.Villmow at amd.com>
>>>>>> wrote:
>>>>>>> Is there a current way to specify that an instruction or  
>>>>>>> function
>>>>>> call
>>>>>>> cannot be duplicated and thus any optimizations that might  
>>>>>>> want to
>>>>>> duplicate
>>>>>>> this instruction would fail?
>>>>>>
>>>>>> No.  Anything can be duplicated.  That could change, but you  
>>>>>> would
>>>>>> need to make a strong case for why other solutions won't work.
>>>>> [Villmow, Micah] Well the problem is that the function in question
>>>>> cannot get duplicated because it has side-effects that duplicating
>>>>> causes undefined behavior on vector hardware. Also, moving the
>>>>> instruction inside of flow control when it is originally outside  
>>>>> of
>>>> flow
>>>>> control produces undefined behavior. There currently is no way to
>>>>> specify this in LLVM that I know of. We've tried lowering it to an
>>>>> intrinsic and setting MayWriteMem and this does not solve the
>>>> problem.
>>>>> After looking at the llvm IR, there is no equivalent method of
>>>>> representing an instruction that is an execution barrier(not a  
>>>>> memory
>>>>> barrier, which llvm.barrier.[ss|ll|ls|sl] is). If you have any
>>>> idea's,
>>>>> we would be willing to give them a try.
>>>>
>>>> Is the effect similar to pthread_barrier_wait(barrier_for($pc))
>>>> [http://linux.die.net/man/3/pthread_barrier_wait]  where the
>>>> implementation automatically generates the barrier_for() function  
>>>> and
>>>> automatically calculates the number of threads to wait for?
>>>>
>>>> If the barrier lowers to any sort of function call, it sounds like
>>>> you're currently looking up the PC of the caller and finding the
>>>> barrier that way. Instead, could specify the barrier as an explicit
>>>> argument to the function when your frontend generates the call
>>>> instruction, which would free you from worrying about whether the  
>>>> call
>>>> winds up in multiple places in the optimized IR.
>>>>
>>>> If the barrier lowers to a magic instruction on your chip, and that
>>>> instruction doesn't take an ID of any sort besides its address, you
>>>> could generate a one-instruction function for each barrier() in the
>>>> source language and allow calls to that function to be duplicated.
>>>> There may be optimizations that merge "identical" functions, but
>>>> they'll be easier to turn off than optimizations that assume they  
>>>> can
>>>> rearrange control flow.
>>>>
>>>> If your chip doesn't support function calls, that might  
>>>> constitute the
>>>> strong case Eli's asking for.
>>> [Villmow, Micah] Jeffrey thanks for the information on pthread.  
>>> The barrier on our hardware is a single instruction, not a  
>>> function call, with no arguments and everything is handled  
>>> implicitly, including keeping track of the number of hardware  
>>> threads that need to hit the barrier. If one of those hardware  
>>> threads does not hit the barrier, then it causes undefined  
>>> behavior. Hence why we need to guarantee that this instruction is  
>>> not optimized around, moved or duplicated as the algorithm writer  
>>> must place it following strict guidelines for it to work correctly.
>>>
>>> So, my strong case for some sort of workaround is this:
>>>
>>> Valid original code is:
>>>  flag = false
>>>  if (cond)
>>>    { flag = bar();
>>>  }
>>>  foo()
>>>  if (flag) {bar}
>>>
>>>
>>> transformes to
>>>
>>> if (cond) {
>>>    flag = bar()
>>>    foo()
>>>    if (flag)
>>>        bar()
>>> } else {
>>>   foo()
>>> }
>>>
>>> Assumptions:
>>> - foo() is the barrier
>>> - each hardware thread is a 64wide vector
>>> - two hardware threads are executing
>>> - Each vector element gets a unique id between 0 and 127
>>> - The condition is true if id > 32
>>> - Predication is used on control flow
>>> What happens in the original code:
>>> The first half of the  first hardware thread predicates  
>>> computation on the first condition, second half executes bar and  
>>> all threads in the second wavefront execute bar. Both hardware  
>>> threads hit the barrier and wait for the other hardware thread to  
>>> reach that point, then continue execution.
>>>
>>> What happens in the optimized code:
>>> first half of the first hardware thread predicates computation on  
>>> the first condition, the second half executes bar and waits at the  
>>> barrier. The second hardware thread executes bar and hits the  
>>> barrier, forcing continuation of execution. Both the first and  
>>> second hardware thread executes the rest of the if block. Once the  
>>> block ends, the predication masks are flipped and the second half  
>>> of the first hardware thread hits the barrier and blocks waiting  
>>> for the second hardware thread. The second hardware thread skips  
>>> execution of the else block thus not hitting the barrier, causing  
>>> the first hardware thread to never return from barrier.
>>>
>>> We have already detected two optimization passes(Loopunswitch and  
>>> simplifycfg) that perform these type of transforms, and we are  
>>> sure there might be more. We want to be able to enable these  
>>> transforms for normal code, but have them respect the barrier  
>>> instruction.
>>
>> ... and even if a new barrier intrinsic that does not allow cloning
>> (not sure how, but anyway...) is introduced, you'll have to modify  
>> all
>> these optimization passes to take a note of this special barrier. New
>> barrier won't be respected automatically.
>>
>> Have you thought about dividing code in separate functions and make
>> sure the inliner does not inline them ?
>>
>> fn1() /* do not inline */
>> your barrier()
>> fn2() /* do not inline */
>
> IMO Jeff's solution is the cleanest, simplest way to get code that
> works.  Just generate a separate function for every barrier in the
> program, and mark it noinline.  This way the instruction pointers will
> be unique to the barrier.
>
> Reid
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev




More information about the llvm-dev mailing list