[Openmp-commits] [PATCH] D14254: [OpenMP] Initial implementation of OpenMP offloading library - libomptarget device RTLs.

Wed Nov 29 08:19:06 PST 2017

Hahnfeld added inline comments.

================
Comment at: libomptarget/deviceRTLs/nvptx/src/data_sharing.cu:40
+  unsigned long long Mask = __BALLOT_SYNC(0xFFFFFFFF, true);
+  unsigned long long ShNum = 32 - (getThreadId() & DS_Max_Worker_Warp_Size_Log2_Mask);
+  unsigned long long Sh = Mask << ShNum;
----------------
sfantao wrote:
> Hahnfeld wrote:
> > sfantao wrote:
> > > Hahnfeld wrote:
> > > > `DS_Max_Worker_Warp_Size` instead of its hard-coded value? Or is 32 the length of an integer? (Another reason to avoid magic numbers!)
> > > Here 32 is both the GPU warp size and the size of the integer __popc() takes, I agree we should get an enumerator to indicate that. What is tricky here, is that the function assumes that the warp size is the same as the 32-bit integer. This assumption is made in CUDA so, I think is fine we also do it here. 
> > > 
> > > Actually, here I think the name of the function `getWarpMasterActiveThreadId ` is misleading. This function returns the number of active threads in the current warp whose ID in the warp is lower than the current one. If there are no active threads with lower ID,  then the current thread is the warp master. In other words, this function returns zero, if the current thread is the warp master.
> > > 
> > > This function seems to only be used in `IsWarpMasterActiveThread`. So I suggest remove the current implementation of `IsWarpMasterActiveThread` and rename this function to `IsWarpMasterActiveThread` and have it return:
> > > ```
> > > return __popc(Sh) == 0u;
> > > ```
> > I'm not sure I get your arguments: Let's start with the function `__popc()` which returns the number of bits set to 1 in the argument. This just happens to be a 32bit integer which has nothing to do with the warp size as far as I can see.
> > 
> > If this function really does what you described it deserves documentation or the code needs to be rewritten so that others can understand. This might be related to what `__BALLOT_SYNC(0xFFFFFFFF, true)` as I've already criticized below.
> > 
> > At the end, I don't understand what your proposal is: The function name you mention is still the same and the return statement basically inverts the value returned of `__popc()`
> > I'm not sure I get your arguments: Let's start with the function __popc() which returns the number of bits set to 1 in the argument. This just happens to be a 32bit integer which has nothing to do with the warp size as far as I can see.
> > 
> > If this function really does what you described it deserves documentation or the code needs to be rewritten so that others can understand. This might be related to what __BALLOT_SYNC(0xFFFFFFFF, true) as I've already criticized below.
> 
> The function is documented in http://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__INT.html#group__CUDA__MATH__INTRINSIC__INT_1g43c9c7d2b9ebf202ff1ef5769989be46. Using 32-bit integers to do mask operations has all to do with the warp size in the sense that if we had larger warp sizes we would have to change the literals (including the ones used with `__ballot`) and use different versions or a combination of multiple `__popc()`. 
> 
> 
> > At the end, I don't understand what your proposal is: The function name you mention is still the same and the return statement basically inverts the value returned of __popc()
> 
> All I am suggesting is to remove the current implementation of  `IsWarpMasterActiveThread`, and rename `getWarpMasterActiveThreadId ` to  `IsWarpMasterActiveThread`.  If `__popc` returns zero means the current thread is the master thread, so comparing it with zero will return the correct result. This is what `IsWarpMasterActiveThread` does right now.
> 
> I agree we can improve the comments to document all this.
Btw: If we just compare `__popc(x) == 0` then we know `x == 0` because x has zero bits set to 1. So no point in doing `__popc()` at all?

================
Comment at: libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h:74
+  DS_Max_Worker_Warp_Size_Log2 = 5,
+  DS_Max_Worker_Warp_Size_Log2_Mask = (~0u >> (32-DS_Max_Worker_Warp_Size_Log2)),
+  // The slot size that should be reserved for a working warp.
----------------
sfantao wrote:
> Hahnfeld wrote:
> > sfantao wrote:
> > > grokos wrote:
> > > > Hahnfeld wrote:
> > > > > First, this value needs a comment what it is used for and maybe we can drop the `Log2` is its a mast related to the value, not the logarithm.
> > > > > 
> > > > > From what I understand, this value has has the 5 LSBs set to 1 which should be equivalent to `DS_Max_Worker_Warp_Size - 1`? (this also avoid relying on 32 bit integers)
> > > > I've changed its definition to the proposed `DS_Max_Worker_Warp_Size - 1` which is much clearer. This mask is used to get (threadID mod Warp_Size). I've added a comment.
> > > Using `DS_Max_Worker_Warp_Size - 1` its fine here. This is mostly used to calculate the ID of a thread in its warp, i.e. `threadID() % warpSize` which is equivalent to ` threadID() & DS_Max_Worker_Warp_Size`. The compiler should be smart enough to convert one to the other if `warpSize` is a constant. 
> > > 
> > > I'd rather prefer do make little optimisations explicit in the code, but if one thinks this extra macro is too much, one could revisit all uses and see if `warpSize` is constant and use `threadID() % warpSize` throughout the code. I don't see any reason for making `warpSize` non-constant anyway. 
> > Do you mean `threadID() % warpSize` is equivalent to `threadID() & (DS_Max_Worker_Warp_Size - 1)`? Because otherwise I disagree.
> > 
> > I agree that the compiler will convert the usage. IMO `threadID() % warpSize` is //a lot more// obvious than the bit operations the compiler might use.
> > Do you mean threadID() % warpSize is equivalent to threadID() & (DS_Max_Worker_Warp_Size - 1) ?
> 
> Yes, that is what I mean. Of course, this assumes warp size is a power of 2.
> 
> > I agree that the compiler will convert the usage. IMO threadID() % warpSize is a lot more obvious than the bit operations the compiler might use.
> 
> I don't have a strong opinion here. I agree that `threadID() % warpSize` is easier to read and I think its fine we go with that. It's just that if we intend the implementation to be `threadID() & (DS_Max_Worker_Warp_Size - 1) ` I don't see why we shouldn't write that in the code.  
1. IMO source code is meant to be understood by humans who are normally much more familiar with mathematical operations (modulo) than bit operations - even programmers.
2. Because it assumes a power of 2. If the compiler can prove that - great, do the optimization.
(If we happen to find performance implications, we can still change it and refactor it to a function and add the documentation why this is necessary, what it assumes and how it works)

Repository:
  rL LLVM

https://reviews.llvm.org/D14254