[Openmp-commits] [PATCH] D64217: [OpenMP][NFCI] Cleanup the target state queue implementation

Wed Jul 10 19:07:37 PDT 2019

jdoerfert marked 2 inline comments as done.
jdoerfert added inline comments.

================
Comment at: openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h:29
-  ElementType elements[SIZE];
-  volatile ElementType *elementQueue[SIZE];
-  volatile uint32_t head;
----------------
ABataev wrote:
> ABataev wrote:
> > jdoerfert wrote:
> > > ABataev wrote:
> > > > jdoerfert wrote:
> > > > > ABataev wrote:
> > > > > > jdoerfert wrote:
> > > > > > > ABataev wrote:
> > > > > > > > jdoerfert wrote:
> > > > > > > > > ABataev wrote:
> > > > > > > > > > jdoerfert wrote:
> > > > > > > > > > > ABataev wrote:
> > > > > > > > > > > > jdoerfert wrote:
> > > > > > > > > > > > > ABataev wrote:
> > > > > > > > > > > > > > jdoerfert wrote:
> > > > > > > > > > > > > > > ABataev wrote:
> > > > > > > > > > > > > > > > jdoerfert wrote:
> > > > > > > > > > > > > > > > > ABataev wrote:
> > > > > > > > > > > > > > > > > > I would like to keep volatile modifier at least for cuda 8. Without volatile it may produce incorrect code. I would keep it until we drop support for cuda 8.
> > > > > > > > > > > > > > > > > > Plus, I would suggest to test it very seriously for other versions of cuda. Does it really works correctly? Ptxas may use some incorrect optimizations without volatile. Though I like the idea of removing them.
> > > > > > > > > > > > > > > > > These members are accessed only through *atomic* accesses. Why do we would require volatile in addition?
> > > > > > > > > > > > > > > > Without `volatile` they were not initialized properly on cuda 8. This, again, seems to me like some kind of a bug in ptxas for cuda 8. Not sure about this problem in cuda 9, it requires some additional testing.
> > > > > > > > > > > > > > > I actually did not notice that there is apparently no initialization of these values (or did I miss it?). If there is none, than that is the problem which needs fixing, e.g., `uint32_t Head = 0`.
> > > > > > > > > > > > > > They are initialized, but without volatile qualifier they are not initialized in the compiled code.
> > > > > > > > > > > > > I checked and I could not find the initialization of these variables anywhere. We should make sure the problem is not the UB of not initializing them before we require `volatile` here.
> > > > > > > > > > > > I don't think there is an UB, we checked it some time ago. The memory is zeroinitialized, but without volatile modifier it works incorrectly at least for cuda 8.
> > > > > > > > > > > > I'm trying to reduce the use of this structure with my patches, because, most probably, it relies on some undocumented features.
> > > > > > > > > > > > Plus, i don't think it is effective. It implements queue to manage the resources, but I think we don't need a queue here, just a hash table. 
> > > > > > > > > > > > I don't think there is an UB, we checked it some time ago. The memory is zeroinitialized, but without volatile modifier it works incorrectly at least for cuda 8.
> > > > > > > > > > > 
> > > > > > > > > > > I'm confused. I was expecting CUDA to have C++ semantics which requires class members to be initialized explicitly, or am I missing something here?
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > > Plus, i don't think it is effective. It implements queue to manage the resources, but I think we don't need a queue here, just a hash table.
> > > > > > > > > > > 
> > > > > > > > > > > That is a question for a follow up patch. I did not try to change the semantics in these ones, just the coding style and organization of the code.
> > > > > > > > > > > I'm confused. I was expecting CUDA to have C++ semantics which requires class members to be initialized explicitly, or am I missing something here?
> > > > > > > > > > 
> > > > > > > > > > Read this https://docs.nvidia.com/cuda/archive/10.1/cuda-c-programming-guide/index.html#device-memory-specifiers
> > > > > > > > > I fail to find the part that says they are initialized to zero. Could you specify where I can find that?
> > > > > > > > There is mo such part. It is implementation feature this class relies on, as I understand. I'm trying to reduce the number of such things in the library.
> > > > > > > I'm confused. I still stay with my earlier statement: This is currently UB and volatile somehow "makes it work". Agreed?
> > > > > > > 
> > > > > > > Couldn't we simply introduce the initialization and get rid of the volatile as all accesses (that are actually there) happen atomically anyway?
> > > > > > Not quite. The value is zero-initialized by default. This is not documented, this is implementation-specific. I agree, that this is better to fix.
> > > > > > 
> > > > > > Atomic operations are not the best choice. They actually sequentialize the execution and it leads to significant performance degradation. Volatile modifier allows avoiding it.
> > > > > > 
> > > > > > I'm not sure that this will fix the problem with incorrect optimizations in cuda 8, at least. Seems to me, this version has some internal incorrect optimizations, which can be prevented with volatile modifier. I would suggest making it cuda 8 specific, at least. When we drop support of cuda 8, we can remove this completely. I'm not sure about this problem in cuda 9, it must be checked very seriously, but as far as I know cuda 9 is much more stable than cuda 8.
> > > > > > Not quite. The value is zero-initialized by default. This is not documented, this is implementation-specific.
> > > > > 
> > > > > How does one learn of this undocumented feature, how can one test for it, how is one save of changes to "the implementation"?
> > > > > 
> > > > > > Atomic operations are not the best choice. They actually sequentialize the execution
> > > > > 
> > > > > 1) That is not true as a general statement, e.g., wrt. non-queue related operations, and
> > > > > 2) I'm pretty sure we want sequential consistency when it comes to modifications of the internal state of this queue. 
> > > > > 
> > > > > 
> > > > > > and it leads to significant performance degradation. Volatile modifier allows avoiding it.
> > > > > 
> > > > > What performance degradation? Do you have performance (and correctness) results for a state-queue without atomic accesses?
> > > > > 
> > > > > 
> > > > > > I'm not sure that this will fix the problem with incorrect optimizations in cuda 8, at least. Seems to me, this version has some internal incorrect optimizations, which can be prevented with volatile modifier. I would suggest making it cuda 8 specific, at least. When we drop support of cuda 8, we can remove this completely. I'm not sure about this problem in cuda 9, it must be checked very seriously, but as far as I know cuda 9 is much more stable than cuda 8.
> > > > >  
> > > > > Did you test the state-queue implementation without `volatile` but *with* explicit initialization of the members? I would like to verify a potential problem myself. If you have a test case and information about the configuration (Cuda version, clang version, hardware, ...) I will do my best to verify your above reasoning.
> > > > > How does one learn of this undocumented feature, how can one test for it, how is one save of changes to "the implementation"?
> > > > 
> > > > Just like I said, I also don't like it. But the atomic alternative is much worse.
> > > > 
> > > > > 1. That is not true as a general statement, e.g., wrt. non-queue related operations, and
> > > > 
> > > > It is true if several threads are trying to access the same memory location. This is exactly our situation.
> > > > 
> > > > > 2.I'm pretty sure we want sequential consistency when it comes to modifications of the internal state of this queue.
> > > > 
> > > > And we do it atomically. And it leads to siginificant performance degradation. Why do you think the full runtime mode/general mode is much slower than the SPMD mode without runtime? Mostly, because of the atomic operations and lockless structures.
> > > > 
> > > > > What performance degradation? Do you have performance (and correctness) results for a state-queue without atomic accesses?
> > > > 
> > > > Not exactly with the queue itself (though, we have an atomic loop in the queue and it makes the performance significantly worse). We had the same situation with teams reductions. We had to change the scheme, because the performance was very poor (again, atomics). You can try to seearch bugzilla, there must be an error report about poor performance of teams reductions and a fix that changes the scheme to reduce the number of atomic operations.
> > > > 
> > > > Our problem here is that cuda does not have block/teams synchronizaion primitives. If we could have block/teams effective sync primitives we could solve this problem easily. Without them, we can use only atomic operations for synchronization and a loop. And we end up with poor performance immediately.
> > > > 
> > > > > Did you test the state-queue implementation without volatile but *with* explicit initialization of the members? I would like to verify a potential problem myself. If you have a test case and information about the configuration (Cuda version, clang version, hardware, ...) I will do my best to verify your above reasoning.
> > > > 
> > > > I tried it some time ago. After some testing I decided that it would be better to try to reduce the use of this queue instead if want to get the performance.
> > > > 
> > > > You can to try to implement direct initialization. You can try any test that requires full runtime mode (does not matter, SPMD or non-SPMD constructs) since you jst want to change the initialization.
> > > Let us stop have these very long, very abstract conversations in code reviews that seem to lead nowhere due to the lack of actual facts. Instead, let me ask a couple of direct questions so that the comments made before become more clear and we can actually make some progress.
> > > 
> > > ---
> > > 
> > > Do you know that the memory of class members in CUDA is zero initialized or not? If so, how does one know and test this?
> > > 
> > > Did you (recently) run any tests, correctness or performance, related to
> > >  - this queue?
> > >  - the use of atomic here?
> > >  - the use of volatile here?
> > >  - the missing initialization here? 
> > > 
> > > How did you meassurre the performance gain/loss due to atomics in non-SPMD-mode given that non-SPMD-mode is vastly different from SPMD-mode? Or, put differently, how does one know the performance difference between the modes stems from the use of atomics and not from anything else that is different, e.g., a software managed stack, the indirection through a state machine, increased register usage, ...?
> > > 
> > > Let us stop have these very long, very abstract conversations in code reviews that seem to lead nowhere due to the lack of actual facts. Instead, let me ask a couple of direct questions so that the comments made before become more clear and we can actually make some progress.
> > 
> > Agree.
> > 
> > > Do you know that the memory of class members in CUDA is zero initialized or not? If so, how does one know and test this?
> > 
> > It initializes the global data with zero. Cuda is LLVM based and relies on the LLVM IR restrictions, I assume. These restrictions make it to zeroinitialize the whole global data.
> > It was not me who decided to rely on this implementation defined behavior, I'm personally not a big fan of it.
> > 
> > > Did you (recently) run any tests, correctness or performance, related to
> > 
> > > this queue?
> > 
> > Almost everytime before sending patch for a review or committing it.
> > 
> > > the use of atomic here?
> > 
> > Not exactly with this structure but with the similar ones, like teams reductions buffers, memory management hash table etc.
> > 
> > > the use of volatile here?
> > 
> > Yes, and least for cuda 8 it is required to have volatile here. Not sure about cuda 9+.
> > 
> > > the missing initialization here?
> > 
> > Again, almost everytime when I submit a patch.
> > 
> > > How did you meassurre the performance gain/loss due to atomics in non-SPMD-mode given that non-SPMD-mode is vastly different from SPMD-mode? Or, put differently, how does one know the performance difference between the modes stems from the use of atomics and not from anything else that is different, e.g., a software managed stack, the indirection through a state machine, increased register usage, ...?
> > 
> > I have a statistics for the test runs with the atomics and without. These statistics results shows everything. We have different tests, the ones with big number of registers usage and small number. I'm personally using nvprof and time and many runs to make the statistics more stable.
> Just in case, take a look at this http://supercomputingblog.com/cuda/cuda-tutorial-4-atomic-operations/ and this http://supercomputingblog.com/cuda/cuda-tutorial-5-performance-of-atomics/
> It initializes the global data with zero. 

I fail to reproduce this in a simple test locally. I have a `__device__` global variable of struct type and the LLVM-IR declaration I get for it has an `undef` initializer, not a `zeroinitializer`. How do/did you verify that it is initialized?

> > the missing initialization here?

> Again, almost everytime when I submit a patch.

I'm talking about the queue implementation and all the things mentioned about it.

> I have a statistics for the test runs with the atomics and without.
So same code, except once run with atomics once without? Is it still correct? If so, why do we have the atomics, e.g., in this queue?

> These statistics results shows everything.
Can you share them?

================
Comment at: openmp/libomptarget/deviceRTLs/nvptx/src/state-queue.h:29
-  ElementType elements[SIZE];
-  volatile ElementType *elementQueue[SIZE];
-  volatile uint32_t head;
----------------
jdoerfert wrote:
> ABataev wrote:
> > ABataev wrote:
> > > jdoerfert wrote:
> > > > ABataev wrote:
> > > > > jdoerfert wrote:
> > > > > > ABataev wrote:
> > > > > > > jdoerfert wrote:
> > > > > > > > ABataev wrote:
> > > > > > > > > jdoerfert wrote:
> > > > > > > > > > ABataev wrote:
> > > > > > > > > > > jdoerfert wrote:
> > > > > > > > > > > > ABataev wrote:
> > > > > > > > > > > > > jdoerfert wrote:
> > > > > > > > > > > > > > ABataev wrote:
> > > > > > > > > > > > > > > jdoerfert wrote:
> > > > > > > > > > > > > > > > ABataev wrote:
> > > > > > > > > > > > > > > > > jdoerfert wrote:
> > > > > > > > > > > > > > > > > > ABataev wrote:
> > > > > > > > > > > > > > > > > > > I would like to keep volatile modifier at least for cuda 8. Without volatile it may produce incorrect code. I would keep it until we drop support for cuda 8.
> > > > > > > > > > > > > > > > > > > Plus, I would suggest to test it very seriously for other versions of cuda. Does it really works correctly? Ptxas may use some incorrect optimizations without volatile. Though I like the idea of removing them.
> > > > > > > > > > > > > > > > > > These members are accessed only through *atomic* accesses. Why do we would require volatile in addition?
> > > > > > > > > > > > > > > > > Without `volatile` they were not initialized properly on cuda 8. This, again, seems to me like some kind of a bug in ptxas for cuda 8. Not sure about this problem in cuda 9, it requires some additional testing.
> > > > > > > > > > > > > > > > I actually did not notice that there is apparently no initialization of these values (or did I miss it?). If there is none, than that is the problem which needs fixing, e.g., `uint32_t Head = 0`.
> > > > > > > > > > > > > > > They are initialized, but without volatile qualifier they are not initialized in the compiled code.
> > > > > > > > > > > > > > I checked and I could not find the initialization of these variables anywhere. We should make sure the problem is not the UB of not initializing them before we require `volatile` here.
> > > > > > > > > > > > > I don't think there is an UB, we checked it some time ago. The memory is zeroinitialized, but without volatile modifier it works incorrectly at least for cuda 8.
> > > > > > > > > > > > > I'm trying to reduce the use of this structure with my patches, because, most probably, it relies on some undocumented features.
> > > > > > > > > > > > > Plus, i don't think it is effective. It implements queue to manage the resources, but I think we don't need a queue here, just a hash table. 
> > > > > > > > > > > > > I don't think there is an UB, we checked it some time ago. The memory is zeroinitialized, but without volatile modifier it works incorrectly at least for cuda 8.
> > > > > > > > > > > > 
> > > > > > > > > > > > I'm confused. I was expecting CUDA to have C++ semantics which requires class members to be initialized explicitly, or am I missing something here?
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > > Plus, i don't think it is effective. It implements queue to manage the resources, but I think we don't need a queue here, just a hash table.
> > > > > > > > > > > > 
> > > > > > > > > > > > That is a question for a follow up patch. I did not try to change the semantics in these ones, just the coding style and organization of the code.
> > > > > > > > > > > > I'm confused. I was expecting CUDA to have C++ semantics which requires class members to be initialized explicitly, or am I missing something here?
> > > > > > > > > > > 
> > > > > > > > > > > Read this https://docs.nvidia.com/cuda/archive/10.1/cuda-c-programming-guide/index.html#device-memory-specifiers
> > > > > > > > > > I fail to find the part that says they are initialized to zero. Could you specify where I can find that?
> > > > > > > > > There is mo such part. It is implementation feature this class relies on, as I understand. I'm trying to reduce the number of such things in the library.
> > > > > > > > I'm confused. I still stay with my earlier statement: This is currently UB and volatile somehow "makes it work". Agreed?
> > > > > > > > 
> > > > > > > > Couldn't we simply introduce the initialization and get rid of the volatile as all accesses (that are actually there) happen atomically anyway?
> > > > > > > Not quite. The value is zero-initialized by default. This is not documented, this is implementation-specific. I agree, that this is better to fix.
> > > > > > > 
> > > > > > > Atomic operations are not the best choice. They actually sequentialize the execution and it leads to significant performance degradation. Volatile modifier allows avoiding it.
> > > > > > > 
> > > > > > > I'm not sure that this will fix the problem with incorrect optimizations in cuda 8, at least. Seems to me, this version has some internal incorrect optimizations, which can be prevented with volatile modifier. I would suggest making it cuda 8 specific, at least. When we drop support of cuda 8, we can remove this completely. I'm not sure about this problem in cuda 9, it must be checked very seriously, but as far as I know cuda 9 is much more stable than cuda 8.
> > > > > > > Not quite. The value is zero-initialized by default. This is not documented, this is implementation-specific.
> > > > > > 
> > > > > > How does one learn of this undocumented feature, how can one test for it, how is one save of changes to "the implementation"?
> > > > > > 
> > > > > > > Atomic operations are not the best choice. They actually sequentialize the execution
> > > > > > 
> > > > > > 1) That is not true as a general statement, e.g., wrt. non-queue related operations, and
> > > > > > 2) I'm pretty sure we want sequential consistency when it comes to modifications of the internal state of this queue. 
> > > > > > 
> > > > > > 
> > > > > > > and it leads to significant performance degradation. Volatile modifier allows avoiding it.
> > > > > > 
> > > > > > What performance degradation? Do you have performance (and correctness) results for a state-queue without atomic accesses?
> > > > > > 
> > > > > > 
> > > > > > > I'm not sure that this will fix the problem with incorrect optimizations in cuda 8, at least. Seems to me, this version has some internal incorrect optimizations, which can be prevented with volatile modifier. I would suggest making it cuda 8 specific, at least. When we drop support of cuda 8, we can remove this completely. I'm not sure about this problem in cuda 9, it must be checked very seriously, but as far as I know cuda 9 is much more stable than cuda 8.
> > > > > >  
> > > > > > Did you test the state-queue implementation without `volatile` but *with* explicit initialization of the members? I would like to verify a potential problem myself. If you have a test case and information about the configuration (Cuda version, clang version, hardware, ...) I will do my best to verify your above reasoning.
> > > > > > How does one learn of this undocumented feature, how can one test for it, how is one save of changes to "the implementation"?
> > > > > 
> > > > > Just like I said, I also don't like it. But the atomic alternative is much worse.
> > > > > 
> > > > > > 1. That is not true as a general statement, e.g., wrt. non-queue related operations, and
> > > > > 
> > > > > It is true if several threads are trying to access the same memory location. This is exactly our situation.
> > > > > 
> > > > > > 2.I'm pretty sure we want sequential consistency when it comes to modifications of the internal state of this queue.
> > > > > 
> > > > > And we do it atomically. And it leads to siginificant performance degradation. Why do you think the full runtime mode/general mode is much slower than the SPMD mode without runtime? Mostly, because of the atomic operations and lockless structures.
> > > > > 
> > > > > > What performance degradation? Do you have performance (and correctness) results for a state-queue without atomic accesses?
> > > > > 
> > > > > Not exactly with the queue itself (though, we have an atomic loop in the queue and it makes the performance significantly worse). We had the same situation with teams reductions. We had to change the scheme, because the performance was very poor (again, atomics). You can try to seearch bugzilla, there must be an error report about poor performance of teams reductions and a fix that changes the scheme to reduce the number of atomic operations.
> > > > > 
> > > > > Our problem here is that cuda does not have block/teams synchronizaion primitives. If we could have block/teams effective sync primitives we could solve this problem easily. Without them, we can use only atomic operations for synchronization and a loop. And we end up with poor performance immediately.
> > > > > 
> > > > > > Did you test the state-queue implementation without volatile but *with* explicit initialization of the members? I would like to verify a potential problem myself. If you have a test case and information about the configuration (Cuda version, clang version, hardware, ...) I will do my best to verify your above reasoning.
> > > > > 
> > > > > I tried it some time ago. After some testing I decided that it would be better to try to reduce the use of this queue instead if want to get the performance.
> > > > > 
> > > > > You can to try to implement direct initialization. You can try any test that requires full runtime mode (does not matter, SPMD or non-SPMD constructs) since you jst want to change the initialization.
> > > > Let us stop have these very long, very abstract conversations in code reviews that seem to lead nowhere due to the lack of actual facts. Instead, let me ask a couple of direct questions so that the comments made before become more clear and we can actually make some progress.
> > > > 
> > > > ---
> > > > 
> > > > Do you know that the memory of class members in CUDA is zero initialized or not? If so, how does one know and test this?
> > > > 
> > > > Did you (recently) run any tests, correctness or performance, related to
> > > >  - this queue?
> > > >  - the use of atomic here?
> > > >  - the use of volatile here?
> > > >  - the missing initialization here? 
> > > > 
> > > > How did you meassurre the performance gain/loss due to atomics in non-SPMD-mode given that non-SPMD-mode is vastly different from SPMD-mode? Or, put differently, how does one know the performance difference between the modes stems from the use of atomics and not from anything else that is different, e.g., a software managed stack, the indirection through a state machine, increased register usage, ...?
> > > > 
> > > > Let us stop have these very long, very abstract conversations in code reviews that seem to lead nowhere due to the lack of actual facts. Instead, let me ask a couple of direct questions so that the comments made before become more clear and we can actually make some progress.
> > > 
> > > Agree.
> > > 
> > > > Do you know that the memory of class members in CUDA is zero initialized or not? If so, how does one know and test this?
> > > 
> > > It initializes the global data with zero. Cuda is LLVM based and relies on the LLVM IR restrictions, I assume. These restrictions make it to zeroinitialize the whole global data.
> > > It was not me who decided to rely on this implementation defined behavior, I'm personally not a big fan of it.
> > > 
> > > > Did you (recently) run any tests, correctness or performance, related to
> > > 
> > > > this queue?
> > > 
> > > Almost everytime before sending patch for a review or committing it.
> > > 
> > > > the use of atomic here?
> > > 
> > > Not exactly with this structure but with the similar ones, like teams reductions buffers, memory management hash table etc.
> > > 
> > > > the use of volatile here?
> > > 
> > > Yes, and least for cuda 8 it is required to have volatile here. Not sure about cuda 9+.
> > > 
> > > > the missing initialization here?
> > > 
> > > Again, almost everytime when I submit a patch.
> > > 
> > > > How did you meassurre the performance gain/loss due to atomics in non-SPMD-mode given that non-SPMD-mode is vastly different from SPMD-mode? Or, put differently, how does one know the performance difference between the modes stems from the use of atomics and not from anything else that is different, e.g., a software managed stack, the indirection through a state machine, increased register usage, ...?
> > > 
> > > I have a statistics for the test runs with the atomics and without. These statistics results shows everything. We have different tests, the ones with big number of registers usage and small number. I'm personally using nvprof and time and many runs to make the statistics more stable.
> > Just in case, take a look at this http://supercomputingblog.com/cuda/cuda-tutorial-4-atomic-operations/ and this http://supercomputingblog.com/cuda/cuda-tutorial-5-performance-of-atomics/
> > It initializes the global data with zero. 
> 
> I fail to reproduce this in a simple test locally. I have a `__device__` global variable of struct type and the LLVM-IR declaration I get for it has an `undef` initializer, not a `zeroinitializer`. How do/did you verify that it is initialized?
> 
> 
> > > the missing initialization here?
> 
> > Again, almost everytime when I submit a patch.
> 
> I'm talking about the queue implementation and all the things mentioned about it.
> 
> 
> 
> > I have a statistics for the test runs with the atomics and without.
> So same code, except once run with atomics once without? Is it still correct? If so, why do we have the atomics, e.g., in this queue?
> 
> > These statistics results shows everything.
> Can you share them?
> 
> Just in case, take a look at this http://supercomputingblog.com/cuda/cuda-tutorial-4-atomic-operations/ and this http://supercomputingblog.com/cuda/cuda-tutorial-5-performance-of-atomics/

This seems to be a micro-benchmark evaulation that stress test atomics operations. It was published ~10 years ago. I really doubt such comments help this discussion.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D64217/new/

https://reviews.llvm.org/D64217