hello! We recently had a build failing on circleci...
# general
a
hello! We recently had a build failing on circleci with:
Copy code
22:43:57.74 [INFO] Starting: installing node project dependencies
The pantsd process was killed during the run.

If this was not intentionally done by you, Pants may have been killed by the operating system due to memory overconsumption (i.e. OOM-killed). You can set the global option `--pantsd-max-memory-usage` to reduce Pantsd's memory consumption by retaining less in its in-memory cache (run `./pants help-advanced global`). You can also disable pantsd with the global option `--no-pantsd` to avoid persisting memory across Pants runs, although you will miss out on additional caching.

If neither of those help, please consider filing a GitHub issue or reaching out on Slack so that we can investigate the possible memory overconsumption (<https://www.pantsbuild.org/docs/getting-help>).
Traceback (most recent call last):
  File "/home/circleci/.cache/pants/setup/bootstrap-Linux-x86_64/2.7.0_py38/lib/python3.8/site-packages/pants/bin/pants_loader.py", line 95, in run_default_entrypoint
    exit_code = runner.run(start_time)
  File "/home/circleci/.cache/pants/setup/bootstrap-Linux-x86_64/2.7.0_py38/lib/python3.8/site-packages/pants/bin/pants_runner.py", line 86, in run
    return remote_runner.run()
  File "/home/circleci/.cache/pants/setup/bootstrap-Linux-x86_64/2.7.0_py38/lib/python3.8/site-packages/pants/bin/remote_pants_runner.py", line 99, in run
    return self._connect_and_execute(pantsd_handle)
  File "/home/circleci/.cache/pants/setup/bootstrap-Linux-x86_64/2.7.0_py38/lib/python3.8/site-packages/pants/bin/remote_pants_runner.py", line 131, in _connect_and_execute
    return PyNailgunClient(port, executor).execute(command, args, modified_env)
native_engine_pyo3.PantsdClientException: The pantsd process was killed during the run.
Memory usage stayed reasonable throughout the build so it wasn’t being killled by our OOM killer The core dump shows that it was instead killed with sigabort so… I assume the call was coming from inside the house somewhere
Copy code
circleci@3b6b9430e214:~/project$ eu-readelf --notes  core.1609.\!home\!circleci\!.cache\!pants\!setup\!bootstrap-Linux-x86_64\!pants.y9Lzwt\!install\!bin\!python3.8  | head

Note segment of 151880 bytes at offset 0x97d8:
  Owner          Data size  Type
  CORE                 336  PRSTATUS
    info.si_signo: 6, info.si_code: 0, info.si_errno: 0, cursig: 6
    sigpend: <>
    sighold: <>
    pid: 1779, ppid: 1, pgrp: 1608, sid: 1608
    utime: 0.004000, stime: 0.004000, cutime: 1.475434, cstime: 0.262632
    orig_rax: 14, fpvalid: 1
Removing the pants cache & re-running allowed the build to go through but I’m curious as to what would be causing the process to abort?
Wondering if anybody has any ideas about this? should I just wait for it to happen again? (I think at least the error message could be better)
c
I think they’ve had a holiday day.. could explain the lack of response thus far…
a
ah thanks
e
That OOM paragraph is unfortunate for anything other than a SIGKILL. I'd guess you may be hitting https://github.com/pantsbuild/pants/issues/12831 but the only common bit is the SIGABRT; so it's unclear. Now that we've had another report I need to dig and get to the bottom since I think I'm still the only maintainer that can repro.
a
oooo thank you! let me know if I can provide any more info that would help
@enough-analyst-54434 FYI this just happened again! super weird. Let me know if I can help at all
e
I apologize, but I dropped my investigation after sinking several days into it with no results. What's needed is me or someone else that can reproduce the error in some way (I can with normal runs on my Linux laptop) to pick this back up and be ready to not drop it; i.e.: potentially tough out multiple days in a row of doing nothing but debugging this.
I was basically a bottleneck on too many things.
a
yeah totally understandable and the weird part is... we can't really reproduce it either since it only appears on our CI builds & clears up if the cache changes i could possibly get you a core dump if... that would be useful?
e
It could be if its different from the one in https://github.com/pantsbuild/pants/issues/12831. I was not able to wrest the cause from that one even after using custom versions of libraries on that backtrace to drill into causes.
Better safe than sorry though. If you've got'em please attach them.
a
I am going through a process with our CTO if core dumps are safe to release! i'll attach here if allowed
(it is 800 megs though idk if you have a different place i should upload it)
e
Excellent. Thanks for your efforts here. Let's see if the upload works here (DM upload if sensitive). I'm open to any alternative that works for you and your company though.
Ok, @average-australia-85137 provided a core dump and the Docker image it was generated in. His case looks like so:
Copy code
(gdb) bt
#0  raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  <signal handler called>
#2  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#3  0x00007f2702c73859 in __GI_abort () at abort.c:79
#4  0x00007f270113008c in mdb_assert_fail (env=0x7ee3ae47a930, expr_txt=expr_txt@entry=0x7f2701199393 "rc == 0", func=func@entry=0x7f2701199728 <__func__.16> "mdb_page_dirty", line=line@entry=2121, 
    file=0x7f2701198e20 "/github/home/.cargo/git/checkouts/lmdb-rs-369bfd26153a2575/06bdfbf/lmdb-sys/lmdb/libraries/liblmdb/mdb.c") at /github/home/.cargo/git/checkouts/lmdb-rs-369bfd26153a2575/06bdfbf/lmdb-sys/lmdb/libraries/liblmdb/mdb.c:1536
#5  0x00007f2700eaae65 in mdb_page_dirty (txn=<optimized out>, txn=<optimized out>, mp=<optimized out>) at /github/home/.cargo/git/checkouts/lmdb-rs-369bfd26153a2575/06bdfbf/lmdb-sys/lmdb/libraries/liblmdb/mdb.c:2108
#6  mdb_page_dirty (txn=0x7ee3ae77bbb0, mp=<optimized out>) at /github/home/.cargo/git/checkouts/lmdb-rs-369bfd26153a2575/06bdfbf/lmdb-sys/lmdb/libraries/liblmdb/mdb.c:2108
#7  0x00007f2700ead365 in mdb_page_alloc (num=num@entry=1, mp=mp@entry=0x7ee3b518cdf8, mc=<optimized out>, mc=<optimized out>) at /github/home/.cargo/git/checkouts/lmdb-rs-369bfd26153a2575/06bdfbf/lmdb-sys/lmdb/libraries/liblmdb/mdb.c:2302
#8  0x00007f2700ead582 in mdb_page_touch (mc=mc@entry=0x7ee3b518d320) at /github/home/.cargo/git/checkouts/lmdb-rs-369bfd26153a2575/06bdfbf/lmdb-sys/lmdb/libraries/liblmdb/mdb.c:2489
#9  0x00007f2700eaf14c in mdb_cursor_touch (mc=mc@entry=0x7ee3b518d320) at /github/home/.cargo/git/checkouts/lmdb-rs-369bfd26153a2575/06bdfbf/lmdb-sys/lmdb/libraries/liblmdb/mdb.c:6481
#10 0x00007f2700eb267f in mdb_cursor_put (mc=mc@entry=0x7ee3b518d320, key=key@entry=0x7ee3b518d790, data=data@entry=0x7ee3b518dc70, flags=<optimized out>, flags@entry=65552)
    at /github/home/.cargo/git/checkouts/lmdb-rs-369bfd26153a2575/06bdfbf/lmdb-sys/lmdb/libraries/liblmdb/mdb.c:6615
#11 0x00007f2700eb52a2 in mdb_put (txn=0x7ee3ae77bbb0, dbi=2, key=0x7ee3b518d790, data=0x7ee3b518dc70, flags=65552) at /github/home/.cargo/git/checkouts/lmdb-rs-369bfd26153a2575/06bdfbf/lmdb-sys/lmdb/libraries/liblmdb/mdb.c:8985
#12 0x00007f27009b8a3a in lmdb::transaction::RwTransaction::reserve (self=0x7ee3b518d788, database=..., key=..., len=12232, flags=...) at /github/home/.cargo/git/checkouts/lmdb-rs-369bfd26153a2575/06bdfbf/src/transaction.rs:303
#13 sharded_lmdb::ShardedLmdb::store::{{closure}}::{{closure}}::{{closure}} (txn=...) at /__w/pants/pants/src/rust/engine/sharded_lmdb/src/lib.rs:424
#14 core::result::Result<T,E>::and_then (self=..., op=...) at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/result.rs:704
#15 sharded_lmdb::ShardedLmdb::store::{{closure}}::{{closure}} () at /__w/pants/pants/src/rust/engine/sharded_lmdb/src/lib.rs:419
#16 task_executor::Executor::spawn_blocking::{{closure}} () at /__w/pants/pants/src/rust/engine/task_executor/src/lib.rs:166
...
And that looks like this: https://www.openldap.org/lists/openldap-devel/201710/msg00019.html The fix for that came in LMDB 0.9.23: https://git.openldap.org/openldap/openldap/-/blob/LMDB_0.9.23/libraries/liblmdb/CHANGES#L4 We're using a version of Rust lmdb-sys that vendors LMDB 0.9.21: https://github.com/pantsbuild/lmdb-rs/tree/06bdfbfc6348f6804127176e561843f214fc17f8/lmdb-sys
We already have an issue to move off lmdb and on to lmdb-rkv which would get us to LMDB 0.9.24 and past this particular LMDB bug: https://github.com/pantsbuild/pants/issues/14115 cc @witty-crayon-22786
👍 2
So, thanks Nate. Its not at all clear that this one coredump is representative of all the SIGABRT you see. We've seen other back tracesas outlined in https://github.com/pantsbuild/pants/issues/12831 and you could be hitting a mix over time.
w
mm. yea, even more reason to fix that. sorry for the trouble Nate.
i’ll scope that one quickly.
a
Rad! Thanks so much for the really quick response - I'm pretty sure (at least on our end) this is the main SIGABRT we've been seeing (not under load & clearing the cache reliably fixes it)