[llvm-bugs] [Bug 46107] New: Poor present table performance

Wed May 27 13:00:50 PDT 2020

https://bugs.llvm.org/show_bug.cgi?id=46107

            Bug ID: 46107
           Summary: Poor present table performance
           Product: OpenMP
           Version: unspecified
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: Runtime Library
          Assignee: unassignedbugs at nondot.org
          Reporter: csdaley at lbl.gov
                CC: llvm-bugs at lists.llvm.org

Created attachment 23545
  --> https://bugs.llvm.org/attachment.cgi?id=23545&action=edit
The benchmark that reveals slow present table performance

It takes a long time to add new entries to the OpenMP present table and a long
time to access pre-existing entries. I have attached a benchmark that captures
the data management requirements of the HPGMG mini-app. This benchmark shows
that adding new entries with [:0] takes a long time (see "Device init" code
section) and retrieving data takes a long time (see "target update from" code
section).

I have two code versions: code version #1 uses [:0] to attach device pointers,
code version #2 manually attaches device pointers in OpenMP target regions. I
have tested 4 configurations using LLVM/Clang-11 from Apr 9 2020.
Configurations 1 and 2 test the case where the present table is small for both
code versions - the effective bandwidth of the "target update from" directive
is 3.6 and 3.8 GB/s. Configurations 3 and 4 test the case where the present
table can be large for both code versions - the effective bandwidth in
configuration 3 is only 0.2 GB/s! The workaround code in configuration 4
achieves 3.8 GB/s, however, the time to initialize the data structure on the
device is more than 20x slower for both configurations 3 and 4 even though the
total problem size is identical for all 4 configurations. The full data is:

+ clang -Wall -Werror -Ofast -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda
slow-present-table.c -o slow-present-table

Small present table configuration. Both code versions achieve 3.6 and 3.8 GB/s.
+ srun -n 1 ./slow-present-table 100 100 10000 1
num_levels=100, num_blocks=100, block_size=10000, mem=0.745058 GB,
code_version=1
Host init time=0.244111 seconds
Device init time=0.747118 seconds
Device kernel time=0.012566 seconds
Transfers=100, Time=0.208275 seconds, Data=0.745058 GB, Bandwidth=3.577283 GB/s
SUCCESS
+ srun -n 1 ./slow-present-table 100 100 10000 2
num_levels=100, num_blocks=100, block_size=10000, mem=0.745058 GB,
code_version=2
Host init time=0.228604 seconds
Device init time=0.568662 seconds
Device kernel time=0.013114 seconds
Transfers=100, Time=0.198643 seconds, Data=0.745058 GB, Bandwidth=3.750740 GB/s
SUCCESS

Large present table configuration. Code version #1 is an order of magnitude
slower than code version #2 for the data transfer!!!
+ srun -n 1 ./slow-present-table 100 10000 100 1
num_levels=100, num_blocks=10000, block_size=100, mem=0.745058 GB,
code_version=1
Host init time=0.234948 seconds
Device init time=38.181732 seconds
Device kernel time=0.058576 seconds
Transfers=100, Time=3.147222 seconds, Data=0.745058 GB, Bandwidth=0.236735 GB/s
SUCCESS
+ srun -n 1 ./slow-present-table 100 10000 100 2
num_levels=100, num_blocks=10000, block_size=100, mem=0.745058 GB,
code_version=2
Host init time=0.236912 seconds
Device init time=20.666237 seconds
Device kernel time=0.056635 seconds
Transfers=100, Time=0.197634 seconds, Data=0.745058 GB, Bandwidth=3.769888 GB/s
SUCCESS

Personal discussion with Johannes Doerfert: "I looked at the code we run in the
slow version and (without profiling). I suspect the problem is that we have 3
entries for each mapped "ptr[:0]" in the Device.ShadowPtrMap. In the other
version we have one entry *temporarily* in there. At some point, I suspect,
this std::map becomes large and dealing with it slows down. It is unclear if we
need these mappings really or not. If so, we could potentially investigate a
more scalable data structure. I can imagine the init is slow because the map is
build, the update is slow because we iterate the map for each update. Maybe
there is also some overhead we introduce by going through a few dynamic
libraries trying to allocate and copy 0 bytes of data for the map with the
empty array section."

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20200527/b3ac0447/attachment.html>