[llvm-dev] [ORC JIT][MLIR] GDBRegistrationListener "second attempt to perform debug registration" assert

Lang Hames via llvm-dev llvm-dev at lists.llvm.org
Tue Jun 2 18:01:51 PDT 2020


Hi Adam,

Sorry for the late response! I'm very glad you were able to identify and
solve the problem, and thank you for sharing your analysis: It will help
anyone else who gets tripped up by this.

Regards,
Lang.

On Fri, May 29, 2020 at 1:22 PM Straw, Adam D <adam.d.straw at intel.com>
wrote:

> Closing the gap on this.  Moving nGraph MLIR pointer to the 3c5ddd58 git
> hash did indeed fix the problem.  Thanks so much for all the help.
>
>
>
> *From:* Straw, Adam D
> *Sent:* Wednesday, May 27, 2020 11:54 AM
> *To:* Lang Hames <lhames at gmail.com>; David Blaikie <dblaikie at gmail.com>
> *Cc:* llvm-dev at lists.llvm.org; Cyphers, Scott <Scott.Cyphers at intel.com>;
> Bobba, Jayaram <jayaram.bobba at intel.com>; Mostafa, Nagy H <
> Nagy.H.Mostafa at intel.com>; Iyer, Swapna <swapna.iyer at intel.com>;
> Caballero, Diego <diego.caballero at intel.com>
> *Subject:* RE: [llvm-dev] [ORC JIT][MLIR] GDBRegistrationListener "second
> attempt to perform debug registration" assert
>
>
>
> Lang,
>
>
>
> It looks like the change I need is here, correct?
>
>
> https://github.com/llvm/llvm-project/commit/3c5dd5863c34ecd51e9d2a49929877d8151dea39
>
>
>
> I would like to move the nGraph MLIR pointer all the way to the tip of
> master but it is expensive (lots of compile and link issues) so hoping to
> take a baby step.
>
>
>
> Adam
>
>
>
> *From:* Straw, Adam D
> *Sent:* Tuesday, May 26, 2020 1:17 PM
> *To:* Lang Hames <lhames at gmail.com>; David Blaikie <dblaikie at gmail.com>
> *Cc:* llvm-dev at lists.llvm.org; Cyphers, Scott <Scott.Cyphers at intel.com>;
> Bobba, Jayaram <jayaram.bobba at intel.com>; Mostafa, Nagy H <
> Nagy.H.Mostafa at intel.com>; Iyer, Swapna <swapna.iyer at intel.com>
> *Subject:* RE: [llvm-dev] [ORC JIT][MLIR] GDBRegistrationListener "second
> attempt to perform debug registration" assert
>
>
>
> Referring to the log messages from my previous mail…  I confused myself
> (and probably others) by reading the “Adding MemMgr 0x55555959f440“
> message as “*Registering* MemMgr 0x55555959f440”.  Thus the address
> mismatch made no sense.  How could we be registering a `MemMgr` address/key
> that does not match once we arrive in `notifyObjectLoaded` method?
>
>
>
> Answer:  Because the registrations is NOT coming from
> RTDyldObjectLinkingLayer.cpp.  I found this first by adding an additional
> debug statement (highlighted) thinking the `reinterpret_cast` may be at
> fault for the key mismatch.  I never saw this print and I then realized my
> “Adding” vs. “Registering” mistake.
>
>
>
>   // Run EventListener notifyLoaded callbacks.
>
>   {
>
>     std::lock_guard<std::mutex> Lock(RTDyldLayerMutex);
>
>     dbgs() << "Registering MemMgr " << MemMgr << "\n";
>
>     auto LOIItr = LoadedObjInfos.find(MemMgr);
>
>     assert(LOIItr != LoadedObjInfos.end() && "LoadedObjInfo missing");
>
>     for (auto *L : EventListeners) {
>
>       dbgs() << "  key = " << reinterpret_cast<uintptr_t>(MemMgr) << "\n";
>
>       L->notifyObjectLoaded(
>
>           static_cast<uint64_t>(reinterpret_cast<uintptr_t>(MemMgr)), *Obj,
>
>           *LOIItr->second);
>
>     }
>
>     LoadedObjInfos.erase(MemMgr);
>
>   }
>
>
>
> So, where is the registration coming from?  Remember that I am on a
> slightly older version of MLIR.  My
> mlir/lib/ExecutionEngine/ExecutionEngine.cpp has the following code with
> highlighted debug statement added by me:
>
>
>
>     objectLayer->setNotifyLoaded(
>
>         [engine = engine.get()](
>
>             llvm::orc::VModuleKey, const llvm::object::ObjectFile &object,
>
>             const llvm::RuntimeDyld::LoadedObjectInfo &objectInfo) {
>
>           if (engine->gdbListener) {
>
>             uint64_t key = static_cast<uint64_t>(
>
>                 reinterpret_cast<uintptr_t>(object.getData().data()));
>
>             dbgs() << "ExecutionEngine key = " << key << "\n";
>
>             engine->gdbListener->notifyObjectLoaded(key, object,
> objectInfo);
>
>           }
>
>         });
>
>
>
> The key printed here matches the key printed in `notifyObjectLoaded`.
> This code is NOT present at the tip of MLIR master.  Hence the solution
> here may be to move to latest MLIR master which I probably should have
> tried before sending this mail.  Will do that now.
>
>
>
> Adam
>
>
>
> *From:* Straw, Adam D
> *Sent:* Tuesday, May 26, 2020 9:41 AM
> *To:* Lang Hames <lhames at gmail.com>; David Blaikie <dblaikie at gmail.com>
> *Cc:* llvm-dev at lists.llvm.org; Cyphers, Scott <Scott.Cyphers at intel.com>;
> Bobba, Jayaram <jayaram.bobba at intel.com>; Mostafa, Nagy H <
> Nagy.H.Mostafa at intel.com>; Iyer, Swapna <swapna.iyer at intel.com>
> *Subject:* RE: [llvm-dev] [ORC JIT][MLIR] GDBRegistrationListener "second
> attempt to perform debug registration" assert
>
>
>
> Sorry for the delay.  Hope everyone had a good long weekend.  See attached
> log captured from GDB session.  Here is the punchline:
>
>
>
> Adding MemMgr 0x55555959f440
>
> unit-test:
> /localdisk/adstraw/ngraph/build/mlir_project/llvm-project/llvm/lib/ExecutionEngine/GDBRegistrationListener.cpp:164:
> virtual void
> {anonymous}::GDBJITRegistrationListener::notifyObjectLoaded(llvm::JITEventListener::ObjectKey,
> const llvm::object::ObjectFile&, const
> llvm::RuntimeDyld::LoadedObjectInfo&): Assertion `ObjectBufferMap.find(K)
> == ObjectBufferMap.end() && "Second attempt to perform debug
> registration."' failed.
>
>
>
> (gdb) print/x K
>
> $2 = 0x555559605c50
>
>
>
> The key (K) causing the assert a) does not match the addresses of the most
> recently added MemMgr and b) does not match the address of ANY MemMgr in
> the entire log file.  The plot thickens.
>
>
>
> Wondering if there is some other object (besides a MemMgr) that registers
> its address-as-key into `ObjectBufferMap`.
>
>
>
> Adam
>
>
>
> *From:* Lang Hames <lhames at gmail.com>
> *Sent:* Thursday, May 21, 2020 4:53 PM
> *To:* David Blaikie <dblaikie at gmail.com>
> *Cc:* Straw, Adam D <adam.d.straw at intel.com>; llvm-dev at lists.llvm.org;
> Cyphers, Scott <scott.cyphers at intel.com>; Bobba, Jayaram <
> jayaram.bobba at intel.com>; Mostafa, Nagy H <nagy.h.mostafa at intel.com>
> *Subject:* Re: [llvm-dev] [ORC JIT][MLIR] GDBRegistrationListener "second
> attempt to perform debug registration" assert
>
>
>
> Hi Adam,
>
>
>
> Calls to the listeners should be protected by the RTDyldLayerMutex. Could
> you apply the attached patch and share the debugging output from one of the
> failing runs?
>
>
>
> Regards,
>
> Lang.
>
>
>
> On Wed, May 20, 2020 at 8:00 PM David Blaikie <dblaikie at gmail.com> wrote:
>
> +Lang
>
>
>
> On Wed, May 20, 2020 at 4:44 PM Straw, Adam D via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> Hi all,
>
> Attention:  Lang Hames
>
>
>
> I am developing the nGraph MLIR
> <https://github.com/NervanaSystems/ngraph/tree/master/src/contrib/mlir>
> implementation and hitting the following assert while running nGraph unit
> tests:
>
>
>
>   assert(ObjectBufferMap.find(K) == ObjectBufferMap.end() &&
>
>          "Second attempt to perform debug registration.");
>
>
>
> Here is a permalink
> <https://github.com/llvm/llvm-project/blob/3d5360a4398bfa6878f94ca9ac55bc568692c765/llvm/lib/ExecutionEngine/GDBRegistrationListener.cpp#L163>
> for that line of code on GitHub.  The failure only occurs when running
> multiple unit tests back-to-back.  Inevitably, if I rerun the failing unit
> test, it passes.  The failure also tends to move around with a different
> unit tests failing on successive runs.
>
>
>
> I am able to hit the failure in GDB.  Here is a partial backtrace:
>
>
>
> #3  0x00007fffec4cf412 in __GI___assert_fail (assertion=assertion at entry=0x7ffff1d145e0
> "ObjectBufferMap.find(K) == ObjectBufferMap.end() && \"Second attempt to
> perform debug registration.\"",
>
>     file=file at entry=0x7ffff1d144d8
> "/localdisk/adstraw/ngraph/build/mlir_project/llvm-project/llvm/lib/ExecutionEngine/GDBRegistrationListener.cpp",
> line=line at entry=164,
>
>     function=function at entry=0x7ffff1d15520 <(anonymous
> namespace)::GDBJITRegistrationListener::notifyObjectLoaded(unsigned long,
> llvm::object::ObjectFile const&, llvm::RuntimeDyld::LoadedObjectInfo
> const&)::__PRETTY_FUNCTION__> "virtual void
> {anonymous}::GDBJITRegistrationListener::notifyObjectLoaded(llvm::JITEventListener::ObjectKey,
> const llvm::object::ObjectFile&, const
> llvm::RuntimeDyld::LoadedObjectInfo&)")
>
>     at assert.c:101
>
> #4  0x00007ffff01ed4d3 in (anonymous
> namespace)::GDBJITRegistrationListener::notifyObjectLoaded
> (this=0x5555591a9270, K=93825060893584, Obj=..., L=...)
>
>     at
> /localdisk/adstraw/ngraph/build/mlir_project/llvm-project/llvm/lib/ExecutionEngine/GDBRegistrationListener.cpp:163
>
> #5  0x00007ffff01b0eca in std::function<void (unsigned long,
> llvm::object::ObjectFile const&, llvm::RuntimeDyld::LoadedObjectInfo
> const&)>::operator()(unsigned long, llvm::object::ObjectFile const&,
> llvm::RuntimeDyld::LoadedObjectInfo const&) const (__args#2=...,
> __args#1=..., __args#0=<optimized out>, this=0x5555594e09b8) at
> /usr/include/c++/7/bits/std_function.h:706
>
> #6  llvm::orc::RTDyldObjectLinkingLayer::onObjLoad (this=<optimized out>,
> K=3, R=..., Obj=..., MemMgr=0x5555591c4160,
> LoadedObjInfo=std::unique_ptr<llvm::RuntimeDyld::LoadedObjectInfo> = {...},
>
>     Resolved=std::map with 8 elements = {...}, InternalSymbols=std::set
> with 0 elements)
>
>     at
> /localdisk/adstraw/ngraph/build/mlir_project/llvm-project/llvm/lib/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.cpp:267
>
>
>
> A little debug context:
>
>
>
>    - *Frame #6* with K=3 we are about to execute the `NotifyLoaded`
>    callback
>    - *Frame #5* we execute the `NotifyLoaded` callback which apparently
>    transforms K from 3 to 93825060893584 (0x5555596CF390) which looks like a
>    pointer.  I don’t know the code well enough to grok what’s going on here.
>    I’m not even sure where this callback resides, but it could be here
>    <https://github.com/llvm/llvm-project/blob/c66f89005f6d23b6885d8f93f33ff27dc60ce7dd/llvm/lib/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.cpp#L312>
>    in RTDyldObjectLinkingLater.cpp where we are using the address of the
>    `MemMgr` as K.
>    - *Frame #4 *now in `notifyObjectLoaded` with K=93825060893584 which
>    will go on to assert in *frame #3* due to the fact that that key (K)
>    already exists in the map
>
>
>
> All of this leads me to believe there is some sort of race.  Theory:  with
> two unit tests A and B running back to back…
>
>
>
>    1. Unit test A allocates and deallocates a `MemMgr` (or whatever other
>    object we are using for a key) at address X
>    2. Unit test A calls ` notifyFreeingObject` to free the object at K=X
>    but gets stuck (does not lock) the ` JITDebugLock` mutex here
>    <https://github.com/llvm/llvm-project/blob/3d5360a4398bfa6878f94ca9ac55bc568692c765/llvm/lib/ExecutionEngine/GDBRegistrationListener.cpp#L181>
>    for whatever reason
>    3. Meanwhile unit test B allocates a `MemMgr` (or whatever other
>    object we are using for a key) at newly freed address X
>    4. Unit test B calls `notifyObjectLoaded` to register the object at
>    K=X and locks the ` JITDebugLock` here
>    <https://github.com/llvm/llvm-project/blob/3d5360a4398bfa6878f94ca9ac55bc568692c765/llvm/lib/ExecutionEngine/GDBRegistrationListener.cpp#L162>
>    with unit test A still waiting in step #2 above
>    5. This goes on to assert as the key was not erased (step #2) before
>    it was added (step #4)
>
>
>
> I could use a little help to debug the error further.  Curiously, we do
> hit this issue in continuous integration (CI) --- only when running on
> development systems “by hand”.  I am investigating the differences (perhaps
> single vs. multi threaded?) as to why this might be the case.
>
>
>
> Note that I am using a slightly old version of MLIR master, I’m at this
> commit:
>
>
>
> commit 2f8b4545f4960778e37114c024073d208751ca89
>
> Author: Adam Straw <adam.d.straw at intel.com>
>
> Date:   Tue Apr 14 22:49:18 2020 +0300
>
>
>
>     [mlir] Fix assert on signed integer type in EDSC
>
>
>
>     Integer type in Std dialect is signless so we should be checking
>
>     for signless integer type instead of signed integer type in EDSC.
>
>
>
>     Differential Revision: https://reviews.llvm.org/D78144
>
>
>
> Thanks in advance for the help.
>
>
>
> Adam
>
>
>
> Note:  I am out of office this Fri and the following Mon but otherwise
> should be prompt with email replies.
>
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200602/44f0c341/attachment-0001.html>


More information about the llvm-dev mailing list