[llvm-dev] lld and thread over-subscription

Fri Oct 2 10:27:49 PDT 2015

----- Original Message -----
> From: "Rui Ueyama" <ruiu at google.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "LLVM Developers" <llvm-dev at lists.llvm.org>, "Rafael Espindola" <rafael.espindola at gmail.com>
> Sent: Thursday, October 1, 2015 1:48:34 PM
> Subject: Re: lld and thread over-subscription
> 
> 
> 
> 
> On Thu, Oct 1, 2015 at 11:22 AM, Hal Finkel < hfinkel at anl.gov >
> wrote:
> 
> 
> ----- Original Message -----
> > From: "Rui Ueyama" < ruiu at google.com >
> > To: "Hal Finkel" < hfinkel at anl.gov >
> > Cc: "LLVM Developers" < llvm-dev at lists.llvm.org >, "Rafael
> > Espindola" < rafael.espindola at gmail.com >
> > Sent: Thursday, October 1, 2015 12:55:20 PM
> > Subject: Re: lld and thread over-subscription
> > 
> > 
> > I honestly think that the ulimit of 1024 max threads is too strict
> > for 48 core machine. Processes are independent each other, so it is
> > not strange for them to spawn as many threads as the number of
> > cores.
> 
> It is an understandable misconfiguration, but not something desirable
> in production.
> 
> > What's the reason you cannot increase the limit?
> > 
> 
> It is a soft limit, and I can. Running 'ulimit -u 3072' and then
> re-running lit causes these failures to go away. My concern is that
> a soft process limit of 1024 is a common default (at least on any
> RedHat-derived Linux distribution) regardless of the number of cores
> on the machine. And, obviously, parallel makes are still very
> common.
> 
> Regardless, do you think it would be reasonable for lit to adjust the
> soft process limit by default to avoid these kinds of issues, at
> least when running our regression tests?
> 
> 
> 
> Yes, I do. If we can avoid the issue by adjusting the soft limit in
> lit, I don't see any reason to not do that.
> 

http://reviews.llvm.org/D13389

Thanks again,
Hal

> 
> Thanks again,
> Hal
> 
> 
> 
> > 
> > On Thu, Oct 1, 2015 at 10:26 AM, Hal Finkel < hfinkel at anl.gov >
> > wrote:
> > 
> > 
> > 
> > 
> > ----- Original Message -----
> > > From: "Rui Ueyama" < ruiu at google.com >
> > > To: "Hal Finkel" < hfinkel at anl.gov >
> > > Cc: "LLVM Developers" < llvm-dev at lists.llvm.org >, "Rafael
> > > Espindola" < rafael.espindola at gmail.com >
> > > Sent: Thursday, October 1, 2015 11:46:05 AM
> > > Subject: Re: lld and thread over-subscription
> > > 
> > > On Thu, Oct 1, 2015 at 9:35 AM, Hal Finkel < hfinkel at anl.gov >
> > > wrote:
> > > 
> > > Hi Rui, et al.,
> > > 
> > > I was experimenting yesterday with building lld on my POWER7
> > > PPC64/Linux machine, and ran into an unfortunate problem. When
> > > running the regressions tests under lit, almost all of the tests
> > > fail like this:
> > > 
> > > terminate called after throwing an instance of
> > > 'std::system_error'
> > > what(): Resource temporarily unavailable
> > > ...
> > > 5 libc.so.6 0x00000080b7847238 abort + 4293480680
> > > 6 libstdc++.so.6 0x00000fff94f0f004
> > > __gnu_cxx::__verbose_terminate_handler() + 4294099316
> > > 7 libstdc++.so.6 0x00000fff94f0bc84
> > > 8 libstdc++.so.6 0x00000fff94f0bccc std::terminate() + 4294087956
> > > 9 libstdc++.so.6 0x00000fff94f0c0c4 __cxa_throw + 4294088780
> > > 10 libstdc++.so.6 0x00000fff94f816e0
> > > std::__throw_system_error(int)
> > > +
> > > 4294526808
> > > 11 libstdc++.so.6 0x00000fff94f83d30
> > > std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>)
> > > + 4294534936
> > > 12 lld 0x000000001002a278
> > > ...
> > > 
> > > which seems to indicate a core problem here with dealing with
> > > thread-resource exhaustion. For almost all tests, running them
> > > individually (or using lit -j 1) works without a problem. We
> > > could
> > > deal with this by limiting the number of threads lld uses when
> > > running regression tests, or limit the number of threads that lit
> > > uses when running lld tests (as we currently do with the OpenMP
> > > runtime tests), but I'm somewhat concerned that users will run
> > > into
> > > this program regardless with heavily-parallelized builds.
> > > 
> > > We could try to catch exceptions that otherwise come from
> > > ThreadPoolExecutor's constructor, but do we compile with
> > > exceptions
> > > enabled?
> > > 
> > > I guess we do not want to enable exceptions to deal with the
> > > issue.
> > > Are COFF tests failing, or just ELF tests? If ELF tests for the
> > > old
> > > LLD are failing, the best way would be to not use threads in the
> > > old
> > > LLD. It has lingering threading issues.
> > > 
> > 
> > To provide a data point; my default environment has this:
> > 
> > $ ulimit -a | grep proc
> > max user processes (-u) 1024
> > 
> > This machine has 48 cores, so with lit running 48 tests leaves each
> > test with only about 20 available threads, much less than the 48
> > each test believes it can use.
> > 
> > This is somewhat non-deterministic, but I just reran things both
> > ways, and here's what I see:
> > 
> > During my last run, these tests fail when running under lit with
> > many
> > parallel tests, but do not fail when run otherwise:
> > 
> > lld :: elf2/basic.s
> > lld :: elf/AArch64/general-dyn-tls-0.test
> > lld :: elf/AArch64/initial-exec-tls-0.test
> > lld :: elf/AArch64/rel-prel32-overflow.test
> > lld :: elf/AArch64/rel-prel64.test
> > lld :: elf/AMDGPU/hsa.test
> > lld :: elf/ARM/arm-symbols.test
> > lld :: elf/ARM/dynamic-symbols.test
> > lld :: elf/ARM/entry-point.test
> > lld :: elf/ARM/exidx.test
> > lld :: elf/ARM/header-flags.test
> > lld :: elf/ARM/mapping-code-model.test
> > lld :: elf/ARM/mapping-symbols.test
> > lld :: elf/ARM/missing-symbol.test
> > lld :: elf/ARM/plt-dynamic.test
> > lld :: elf/ARM/plt-ifunc-interwork.test
> > lld :: elf/ARM/plt-ifunc-mapping.test
> > lld :: elf/ARM/rel-arm-call.test
> > lld :: elf/ARM/rel-arm-jump24-veneer-b.test
> > lld :: elf/ARM/rel-arm-mov.test
> > lld :: elf/ARM/rel-arm-prel31.test
> > lld :: elf/ARM/rel-arm-target1.test
> > lld :: elf/ARM/rel-arm-thm-interwork.test
> > lld :: elf/ARM/undef-lazy-symbol.test
> > lld :: elf/Hexagon/dynlib-data.test
> > lld :: elf/Mips/exe-dynamic.test
> > lld :: elf/Mips/exe-dynsym.test
> > lld :: elf/Mips/exe-fileheader-64.test
> > lld :: elf/Mips/exe-fileheader-micro-64.test
> > lld :: elf/Mips/exe-fileheader-n32.test
> > lld :: elf/Mips/exe-got-micro.test
> > lld :: elf/Mips/exe-got.test
> > lld :: elf/Mips/got16-2.test
> > lld :: elf/Mips/got16-micro.test
> > lld :: elf/Mips/got-page-32-micro.test
> > lld :: elf/Mips/got-page-64-micro.test
> > lld :: elf/Mips/got-page-64.test
> > lld :: elf/X86_64/sectionchoice.test
> > lld :: elf/X86_64/sectionmap.test
> > lld :: mach-o/arm-interworking.yaml
> > lld :: mach-o/arm-shims.yaml
> > lld :: mach-o/data-only-dylib.yaml
> > lld :: mach-o/executable-exports.yaml
> > lld :: mach-o/exe-offsets.yaml
> > lld :: mach-o/exported_symbols_list-undef.yaml
> > lld :: mach-o/fat-archive.yaml
> > lld :: mach-o/flat_namespace_undef_error.yaml
> > lld :: mach-o/flat_namespace_undef_suppress.yaml
> > lld :: mach-o/force_load-x86_64.yaml
> > lld :: mach-o/got-order.yaml
> > lld :: mach-o/hello-world-arm64.yaml
> > lld :: mach-o/hello-world-armv6.yaml
> > lld :: mach-o/hello-world-x86_64.yaml
> > lld :: mach-o/hello-world-x86.yaml
> > lld :: mach-o/keep_private_externs.yaml
> > lld :: mach-o/lazy-bind-x86_64.yaml
> > lld :: mach-o/library-rescan.yaml
> > lld :: mach-o/mh_bundle_header.yaml
> > lld :: mach-o/mh_dylib_header.yaml
> > lld :: mach-o/objc_export_list.yaml
> > lld :: mach-o/order_file-basic.yaml
> > lld :: mach-o/parse-aliases.yaml
> > lld :: mach-o/parse-cfstring32.yaml
> > lld :: mach-o/parse-cfstring64.yaml
> > lld :: mach-o/parse-compact-unwind32.yaml
> > lld :: mach-o/parse-compact-unwind64.yaml
> > lld :: mach-o/parse-data-in-code-armv7.yaml
> > lld :: mach-o/parse-data-in-code-x86.yaml
> > lld :: mach-o/parse-data-relocs-arm64.yaml
> > lld :: mach-o/parse-data-relocs-x86_64.yaml
> > lld :: mach-o/parse-data.yaml
> > lld :: mach-o/parse-eh-frame-relocs-x86_64.yaml
> > lld :: mach-o/parse-eh-frame-x86-anon.yaml
> > lld :: mach-o/parse-eh-frame-x86-labeled.yaml
> > lld :: mach-o/parse-eh-frame.yaml
> > lld :: mach-o/parse-function.yaml
> > lld :: mach-o/parse-initializers32.yaml
> > lld :: mach-o/parse-initializers64.yaml
> > lld :: mach-o/parse-literals-error.yaml
> > lld :: mach-o/parse-literals.yaml
> > lld :: mach-o/parse-non-lazy-pointers.yaml
> > lld :: mach-o/parse-relocs-x86.yaml
> > lld :: mach-o/parse-section-no-symbol.yaml
> > lld :: mach-o/parse-tentative-defs.yaml
> > lld :: mach-o/parse-text-relocs-x86_64.yaml
> > lld :: mach-o/parse-tlv-relocs-x86-64.yaml
> > lld :: mach-o/re-exported-dylib-ordinal.yaml
> > lld :: mach-o/rpath.yaml
> > lld :: mach-o/run-tlv-pass-x86-64.yaml
> > lld :: mach-o/sectalign.yaml
> > lld :: mach-o/twolevel_namespace_undef_dynamic_lookup.yaml
> > lld :: mach-o/usage.yaml
> > lld :: mach-o/use-simple-dylib.yaml
> > lld :: mach-o/write-final-sections.yaml
> > lld :: mach-o/wrong-arch-error.yaml
> > lld-Unit :: CoreTests/CoreTests/Range.conversion_to_pointer_range
> > lld-Unit :: CoreTests/CoreTests/Range.slice
> > lld-Unit :: CoreTests/CoreTests/Range.user1
> > lld-Unit :: CoreTests/CoreTests/Range.user2
> > 
> > Of these, the following tests don't fail, but are reported as
> > 'Unresolved' (which does not happen if I run lit -j 1):
> > 
> > lld :: elf/ARM/mapping-code-model.test
> > lld :: elf/ARM/mapping-symbols.test
> > lld :: elf/ARM/missing-symbol.test
> > lld :: elf/ARM/plt-ifunc-interwork.test
> > lld :: elf/ARM/rel-arm-jump24-veneer-b.test
> > lld :: elf/Mips/exe-got-micro.test
> > lld :: elf/Mips/exe-got.test
> > lld :: elf/Mips/got16-micro.test
> > lld :: mach-o/parse-cfstring64.yaml
> > lld :: mach-o/parse-compact-unwind32.yaml
> > lld :: mach-o/parse-compact-unwind64.yaml
> > lld :: mach-o/parse-data-in-code-armv7.yaml
> > lld :: mach-o/parse-data-in-code-x86.yaml
> > lld :: mach-o/parse-data-relocs-arm64.yaml
> > lld :: mach-o/parse-data-relocs-x86_64.yaml
> > lld :: mach-o/parse-data.yaml
> > lld :: mach-o/parse-eh-frame-relocs-x86_64.yaml
> > lld :: mach-o/parse-eh-frame-x86-anon.yaml
> > lld :: mach-o/parse-eh-frame-x86-labeled.yaml
> > lld :: mach-o/parse-eh-frame.yaml
> > lld :: mach-o/parse-function.yaml
> > lld :: mach-o/parse-initializers32.yaml
> > lld :: mach-o/parse-initializers64.yaml
> > lld :: mach-o/parse-literals-error.yaml
> > lld :: mach-o/parse-literals.yaml
> > lld :: mach-o/parse-non-lazy-pointers.yaml
> > lld :: mach-o/parse-relocs-x86.yaml
> > lld :: mach-o/parse-section-no-symbol.yaml
> > lld :: mach-o/parse-tentative-defs.yaml
> > lld :: mach-o/parse-text-relocs-arm64.yaml
> > lld :: mach-o/parse-text-relocs-x86_64.yaml
> > lld :: mach-o/parse-tlv-relocs-x86-64.yaml
> > lld :: mach-o/rpath.yaml
> > lld :: mach-o/run-tlv-pass-x86-64.yaml
> > lld :: mach-o/twolevel_namespace_undef_dynamic_lookup.yaml
> > lld :: mach-o/usage.yaml
> > lld-Unit :: CoreTests/CoreTests/Range.conversion_to_pointer_range
> > lld-Unit :: CoreTests/CoreTests/Range.slice
> > lld-Unit :: CoreTests/CoreTests/Range.user1
> > lld-Unit :: CoreTests/CoreTests/Range.user2
> > 
> > these are listed as unresolved for the same underlying reason, for
> > example:
> > 
> > ********************
> > UNRESOLVED: lld-Unit :: CoreTests/CoreTests/Range.user1 (25040 of
> > 25181)
> > ******************** TEST 'lld-Unit ::
> > CoreTests/CoreTests/Range.user1' FAILED ********************
> > Exception during script execution:
> > Traceback (most recent call last):
> > File "/src/llvm/utils/lit/lit/run.py", line 166, in execute_test
> > result = test.config.test_format.execute(test, self.lit_config)
> > File "/src/llvm/utils/lit/lit/formats/googletest.py", line 113, in
> > execute
> > cmd, env=test.config.environment)
> > File "/src/llvm/utils/lit/lit/util.py", line 166, in executeCommand
> > env=env, close_fds=kUseCloseFDs)
> > File "/install/ppc64/Python-2.7/lib/python2.7/subprocess.py", line
> > 710, in __init__
> > errread, errwrite)
> > File "/install/ppc64/Python-2.7/lib/python2.7/subprocess.py", line
> > 1231, in _execute_child
> > self.pid = os.fork()
> > OSError: [Errno 11] Resource temporarily unavailable
> > 
> > Being naturally nondeterministic, running again with the default
> > number of parallel lit tests changes which tests fail (for example,
> > running a second time adds tests under COFF).
> > 
> > And, FWIW, these tests generally fail on my system (for reasons
> > seemingly unrelated to the thread/process resource issue):
> > 
> > lld :: Driver/lib-search.test
> > lld :: Driver/undef-basic.objtxt
> > lld :: elf2/dynamic-reloc.s
> > lld :: elf2/shared.s
> > lld :: elf2/soname.s
> > lld :: elf/librarynotfound.test
> > lld :: elf/responsefile.test
> > lld :: mach-o/dylib-install-names.yaml
> > lld :: mach-o/force_load-dylib.yaml
> > lld :: mach-o/lib-search-paths.yaml
> > lld :: mach-o/parse-text-relocs-arm64.yaml
> > lld :: mach-o/upward-dylib-load-command.yaml
> > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.AsNeeded
> > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.DefsymAlias
> > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.DefsymDecimal
> > lld-Unit ::
> > DriverTests/DriverTests/GnuLdParserTest.DefsymHexadecimal
> > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.DefsymOctal
> > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.Empty
> > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.Entry
> > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.EntryJoined
> > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.EntryShort
> > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.ExportDynamic
> > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.Init
> > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.InitJoined
> > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.NoExportDynamic
> > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.NoinhibitExec
> > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.Output
> > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.OutputDefault
> > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.Rpath
> > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.RpathEq
> > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.SOName
> > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.SONameH
> > lld-Unit ::
> > DriverTests/DriverTests/GnuLdParserTest.SONameSingleDash
> > lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.Entry
> > lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.ExprEval
> > lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.Group
> > lld-Unit ::
> > DriverTests/DriverTests/LinkerScriptTest.IgnoreSearchDirNoStdLib
> > lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.Input
> > lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.Output
> > lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.SearchDir
> > lld-Unit :: DriverTests/DriverTests/UniversalDriver.flavor
> > 
> > (it could be big-Endian issues, LLVM bugs, etc. -- I've yet to
> > investigate).
> > 
> > The easiest thing to do is to make lld tests run using lit -j 1,
> > but
> > we may also want to think about how to more-gracefully handle this
> > situation in general, because it seems like something a user is not
> > unlikely to hit.
> > 
> > Thanks again,
> > Hal
> > 
> > 
> > 
> > > 
> > > Thanks again,
> > > Hal
> > > 
> > > --
> > > Hal Finkel
> > > Assistant Computational Scientist
> > > Leadership Computing Facility
> > > Argonne National Laboratory
> > > 
> > > 
> > 
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
> > 
> > 
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory