[llvm-dev] lld and thread over-subscription

Fri Oct 2 19:41:53 PDT 2015

----- Original Message -----
> From: "Sean Silva" <chisophugis at gmail.com>
> To: "Rui Ueyama" <ruiu at google.com>
> Cc: "Hal Finkel" <hfinkel at anl.gov>, "LLVM Developers" <llvm-dev at lists.llvm.org>
> Sent: Friday, October 2, 2015 9:37:17 PM
> Subject: Re: [llvm-dev] lld and thread over-subscription
> 
> 
> On Thu, Oct 1, 2015 at 10:55 AM, Rui Ueyama via llvm-dev <
> llvm-dev at lists.llvm.org > wrote:
> 
> 
> 
> I honestly think that the ulimit of 1024 max threads is too strict
> for 48 core machine. Processes are independent each other, so it is
> not strange for them to spawn as many threads as the number of
> cores. What's the reason you cannot increase the limit?
> 
> 
> Yeah, this is it. We've run into this internally on our linux bots.
> Basically, the threading abstractions inside LLD spawn #cores
> threads for their thread pool as one of the very first things. So if
> your build is #cores wide, you end up with #cores ^ 2 threads total.
> 
> 
> The simplest solutions is just upping the ulimit. This may be
> something we can even do inside lit so users automatically see it.

r249161 should do exactly this.

Thanks again,
Hal

> Beyond that, changes to LLD could ameliorate this; fundamentally
> though it has to do with thread pools knowing how many threads they
> need to spin up. A nasty solution could be an environment variable
> like LLD_NUM_THREADS. We could also have a command line flag, and do
> something like `%lld` in the tests like we do for clang like
> `%clang_cc1`, where some extra stuff is inserted in the expansion
> telling lld to use a smaller thread count (for the tests,
> --num-threads=1 would be fine I think).
> 
> 
> -- Sean Silva
> 
> 
> 
> 
> 
> 
> 
> 
> On Thu, Oct 1, 2015 at 10:26 AM, Hal Finkel < hfinkel at anl.gov >
> wrote:
> 
> 
> 
> 
> ----- Original Message -----
> > From: "Rui Ueyama" < ruiu at google.com >
> > To: "Hal Finkel" < hfinkel at anl.gov >
> > Cc: "LLVM Developers" < llvm-dev at lists.llvm.org >, "Rafael
> > Espindola" < rafael.espindola at gmail.com >
> > Sent: Thursday, October 1, 2015 11:46:05 AM
> > Subject: Re: lld and thread over-subscription
> > 
> > On Thu, Oct 1, 2015 at 9:35 AM, Hal Finkel < hfinkel at anl.gov >
> > wrote:
> > 
> > Hi Rui, et al.,
> > 
> > I was experimenting yesterday with building lld on my POWER7
> > PPC64/Linux machine, and ran into an unfortunate problem. When
> > running the regressions tests under lit, almost all of the tests
> > fail like this:
> > 
> > terminate called after throwing an instance of 'std::system_error'
> > what(): Resource temporarily unavailable
> > ...
> > 5 libc.so.6 0x00000080b7847238 abort + 4293480680
> > 6 libstdc++.so.6 0x00000fff94f0f004
> > __gnu_cxx::__verbose_terminate_handler() + 4294099316
> > 7 libstdc++.so.6 0x00000fff94f0bc84
> > 8 libstdc++.so.6 0x00000fff94f0bccc std::terminate() + 4294087956
> > 9 libstdc++.so.6 0x00000fff94f0c0c4 __cxa_throw + 4294088780
> > 10 libstdc++.so.6 0x00000fff94f816e0 std::__throw_system_error(int)
> > +
> > 4294526808
> > 11 libstdc++.so.6 0x00000fff94f83d30
> > std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>)
> > + 4294534936
> > 12 lld 0x000000001002a278
> > ...
> > 
> > which seems to indicate a core problem here with dealing with
> > thread-resource exhaustion. For almost all tests, running them
> > individually (or using lit -j 1) works without a problem. We could
> > deal with this by limiting the number of threads lld uses when
> > running regression tests, or limit the number of threads that lit
> > uses when running lld tests (as we currently do with the OpenMP
> > runtime tests), but I'm somewhat concerned that users will run into
> > this program regardless with heavily-parallelized builds.
> > 
> > We could try to catch exceptions that otherwise come from
> > ThreadPoolExecutor's constructor, but do we compile with exceptions
> > enabled?
> > 
> > I guess we do not want to enable exceptions to deal with the issue.
> > Are COFF tests failing, or just ELF tests? If ELF tests for the old
> > LLD are failing, the best way would be to not use threads in the
> > old
> > LLD. It has lingering threading issues.
> > 
> 
> To provide a data point; my default environment has this:
> 
> $ ulimit -a | grep proc
> max user processes (-u) 1024
> 
> This machine has 48 cores, so with lit running 48 tests leaves each
> test with only about 20 available threads, much less than the 48
> each test believes it can use.
> 
> This is somewhat non-deterministic, but I just reran things both
> ways, and here's what I see:
> 
> During my last run, these tests fail when running under lit with many
> parallel tests, but do not fail when run otherwise:
> 
> lld :: elf2/basic.s
> lld :: elf/AArch64/general-dyn-tls-0.test
> lld :: elf/AArch64/initial-exec-tls-0.test
> lld :: elf/AArch64/rel-prel32-overflow.test
> lld :: elf/AArch64/rel-prel64.test
> lld :: elf/AMDGPU/hsa.test
> lld :: elf/ARM/arm-symbols.test
> lld :: elf/ARM/dynamic-symbols.test
> lld :: elf/ARM/entry-point.test
> lld :: elf/ARM/exidx.test
> lld :: elf/ARM/header-flags.test
> lld :: elf/ARM/mapping-code-model.test
> lld :: elf/ARM/mapping-symbols.test
> lld :: elf/ARM/missing-symbol.test
> lld :: elf/ARM/plt-dynamic.test
> lld :: elf/ARM/plt-ifunc-interwork.test
> lld :: elf/ARM/plt-ifunc-mapping.test
> lld :: elf/ARM/rel-arm-call.test
> lld :: elf/ARM/rel-arm-jump24-veneer-b.test
> lld :: elf/ARM/rel-arm-mov.test
> lld :: elf/ARM/rel-arm-prel31.test
> lld :: elf/ARM/rel-arm-target1.test
> lld :: elf/ARM/rel-arm-thm-interwork.test
> lld :: elf/ARM/undef-lazy-symbol.test
> lld :: elf/Hexagon/dynlib-data.test
> lld :: elf/Mips/exe-dynamic.test
> lld :: elf/Mips/exe-dynsym.test
> lld :: elf/Mips/exe-fileheader-64.test
> lld :: elf/Mips/exe-fileheader-micro-64.test
> lld :: elf/Mips/exe-fileheader-n32.test
> lld :: elf/Mips/exe-got-micro.test
> lld :: elf/Mips/exe-got.test
> lld :: elf/Mips/got16-2.test
> lld :: elf/Mips/got16-micro.test
> lld :: elf/Mips/got-page-32-micro.test
> lld :: elf/Mips/got-page-64-micro.test
> lld :: elf/Mips/got-page-64.test
> lld :: elf/X86_64/sectionchoice.test
> lld :: elf/X86_64/sectionmap.test
> lld :: mach-o/arm-interworking.yaml
> lld :: mach-o/arm-shims.yaml
> lld :: mach-o/data-only-dylib.yaml
> lld :: mach-o/executable-exports.yaml
> lld :: mach-o/exe-offsets.yaml
> lld :: mach-o/exported_symbols_list-undef.yaml
> lld :: mach-o/fat-archive.yaml
> lld :: mach-o/flat_namespace_undef_error.yaml
> lld :: mach-o/flat_namespace_undef_suppress.yaml
> lld :: mach-o/force_load-x86_64.yaml
> lld :: mach-o/got-order.yaml
> lld :: mach-o/hello-world-arm64.yaml
> lld :: mach-o/hello-world-armv6.yaml
> lld :: mach-o/hello-world-x86_64.yaml
> lld :: mach-o/hello-world-x86.yaml
> lld :: mach-o/keep_private_externs.yaml
> lld :: mach-o/lazy-bind-x86_64.yaml
> lld :: mach-o/library-rescan.yaml
> lld :: mach-o/mh_bundle_header.yaml
> lld :: mach-o/mh_dylib_header.yaml
> lld :: mach-o/objc_export_list.yaml
> lld :: mach-o/order_file-basic.yaml
> lld :: mach-o/parse-aliases.yaml
> lld :: mach-o/parse-cfstring32.yaml
> lld :: mach-o/parse-cfstring64.yaml
> lld :: mach-o/parse-compact-unwind32.yaml
> lld :: mach-o/parse-compact-unwind64.yaml
> lld :: mach-o/parse-data-in-code-armv7.yaml
> lld :: mach-o/parse-data-in-code-x86.yaml
> lld :: mach-o/parse-data-relocs-arm64.yaml
> lld :: mach-o/parse-data-relocs-x86_64.yaml
> lld :: mach-o/parse-data.yaml
> lld :: mach-o/parse-eh-frame-relocs-x86_64.yaml
> lld :: mach-o/parse-eh-frame-x86-anon.yaml
> lld :: mach-o/parse-eh-frame-x86-labeled.yaml
> lld :: mach-o/parse-eh-frame.yaml
> lld :: mach-o/parse-function.yaml
> lld :: mach-o/parse-initializers32.yaml
> lld :: mach-o/parse-initializers64.yaml
> lld :: mach-o/parse-literals-error.yaml
> lld :: mach-o/parse-literals.yaml
> lld :: mach-o/parse-non-lazy-pointers.yaml
> lld :: mach-o/parse-relocs-x86.yaml
> lld :: mach-o/parse-section-no-symbol.yaml
> lld :: mach-o/parse-tentative-defs.yaml
> lld :: mach-o/parse-text-relocs-x86_64.yaml
> lld :: mach-o/parse-tlv-relocs-x86-64.yaml
> lld :: mach-o/re-exported-dylib-ordinal.yaml
> lld :: mach-o/rpath.yaml
> lld :: mach-o/run-tlv-pass-x86-64.yaml
> lld :: mach-o/sectalign.yaml
> lld :: mach-o/twolevel_namespace_undef_dynamic_lookup.yaml
> lld :: mach-o/usage.yaml
> lld :: mach-o/use-simple-dylib.yaml
> lld :: mach-o/write-final-sections.yaml
> lld :: mach-o/wrong-arch-error.yaml
> lld-Unit :: CoreTests/CoreTests/Range.conversion_to_pointer_range
> lld-Unit :: CoreTests/CoreTests/Range.slice
> lld-Unit :: CoreTests/CoreTests/Range.user1
> lld-Unit :: CoreTests/CoreTests/Range.user2
> 
> Of these, the following tests don't fail, but are reported as
> 'Unresolved' (which does not happen if I run lit -j 1):
> 
> lld :: elf/ARM/mapping-code-model.test
> lld :: elf/ARM/mapping-symbols.test
> lld :: elf/ARM/missing-symbol.test
> lld :: elf/ARM/plt-ifunc-interwork.test
> lld :: elf/ARM/rel-arm-jump24-veneer-b.test
> lld :: elf/Mips/exe-got-micro.test
> lld :: elf/Mips/exe-got.test
> lld :: elf/Mips/got16-micro.test
> lld :: mach-o/parse-cfstring64.yaml
> lld :: mach-o/parse-compact-unwind32.yaml
> lld :: mach-o/parse-compact-unwind64.yaml
> lld :: mach-o/parse-data-in-code-armv7.yaml
> lld :: mach-o/parse-data-in-code-x86.yaml
> lld :: mach-o/parse-data-relocs-arm64.yaml
> lld :: mach-o/parse-data-relocs-x86_64.yaml
> lld :: mach-o/parse-data.yaml
> lld :: mach-o/parse-eh-frame-relocs-x86_64.yaml
> lld :: mach-o/parse-eh-frame-x86-anon.yaml
> lld :: mach-o/parse-eh-frame-x86-labeled.yaml
> lld :: mach-o/parse-eh-frame.yaml
> lld :: mach-o/parse-function.yaml
> lld :: mach-o/parse-initializers32.yaml
> lld :: mach-o/parse-initializers64.yaml
> lld :: mach-o/parse-literals-error.yaml
> lld :: mach-o/parse-literals.yaml
> lld :: mach-o/parse-non-lazy-pointers.yaml
> lld :: mach-o/parse-relocs-x86.yaml
> lld :: mach-o/parse-section-no-symbol.yaml
> lld :: mach-o/parse-tentative-defs.yaml
> lld :: mach-o/parse-text-relocs-arm64.yaml
> lld :: mach-o/parse-text-relocs-x86_64.yaml
> lld :: mach-o/parse-tlv-relocs-x86-64.yaml
> lld :: mach-o/rpath.yaml
> lld :: mach-o/run-tlv-pass-x86-64.yaml
> lld :: mach-o/twolevel_namespace_undef_dynamic_lookup.yaml
> lld :: mach-o/usage.yaml
> lld-Unit :: CoreTests/CoreTests/Range.conversion_to_pointer_range
> lld-Unit :: CoreTests/CoreTests/Range.slice
> lld-Unit :: CoreTests/CoreTests/Range.user1
> lld-Unit :: CoreTests/CoreTests/Range.user2
> 
> these are listed as unresolved for the same underlying reason, for
> example:
> 
> ********************
> UNRESOLVED: lld-Unit :: CoreTests/CoreTests/Range.user1 (25040 of
> 25181)
> ******************** TEST 'lld-Unit ::
> CoreTests/CoreTests/Range.user1' FAILED ********************
> Exception during script execution:
> Traceback (most recent call last):
> File "/src/llvm/utils/lit/lit/run.py", line 166, in execute_test
> result = test.config.test_format.execute(test, self.lit_config)
> File "/src/llvm/utils/lit/lit/formats/googletest.py", line 113, in
> execute
> cmd, env=test.config.environment)
> File "/src/llvm/utils/lit/lit/util.py", line 166, in executeCommand
> env=env, close_fds=kUseCloseFDs)
> File "/install/ppc64/Python-2.7/lib/python2.7/subprocess.py", line
> 710, in __init__
> errread, errwrite)
> File "/install/ppc64/Python-2.7/lib/python2.7/subprocess.py", line
> 1231, in _execute_child
> self.pid = os.fork()
> OSError: [Errno 11] Resource temporarily unavailable
> 
> Being naturally nondeterministic, running again with the default
> number of parallel lit tests changes which tests fail (for example,
> running a second time adds tests under COFF).
> 
> And, FWIW, these tests generally fail on my system (for reasons
> seemingly unrelated to the thread/process resource issue):
> 
> lld :: Driver/lib-search.test
> lld :: Driver/undef-basic.objtxt
> lld :: elf2/dynamic-reloc.s
> lld :: elf2/shared.s
> lld :: elf2/soname.s
> lld :: elf/librarynotfound.test
> lld :: elf/responsefile.test
> lld :: mach-o/dylib-install-names.yaml
> lld :: mach-o/force_load-dylib.yaml
> lld :: mach-o/lib-search-paths.yaml
> lld :: mach-o/parse-text-relocs-arm64.yaml
> lld :: mach-o/upward-dylib-load-command.yaml
> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.AsNeeded
> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.DefsymAlias
> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.DefsymDecimal
> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.DefsymHexadecimal
> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.DefsymOctal
> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.Empty
> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.Entry
> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.EntryJoined
> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.EntryShort
> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.ExportDynamic
> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.Init
> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.InitJoined
> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.NoExportDynamic
> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.NoinhibitExec
> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.Output
> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.OutputDefault
> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.Rpath
> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.RpathEq
> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.SOName
> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.SONameH
> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.SONameSingleDash
> lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.Entry
> lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.ExprEval
> lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.Group
> lld-Unit ::
> DriverTests/DriverTests/LinkerScriptTest.IgnoreSearchDirNoStdLib
> lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.Input
> lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.Output
> lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.SearchDir
> lld-Unit :: DriverTests/DriverTests/UniversalDriver.flavor
> 
> (it could be big-Endian issues, LLVM bugs, etc. -- I've yet to
> investigate).
> 
> The easiest thing to do is to make lld tests run using lit -j 1, but
> we may also want to think about how to more-gracefully handle this
> situation in general, because it seems like something a user is not
> unlikely to hit.
> 
> Thanks again,
> Hal
> 
> 
> 
> > 
> > Thanks again,
> > Hal
> > 
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
> > 
> > 
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 
> 
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory