hello We recently had a build failing on circleci with ```22 Pants #general

hello! We recently had a build failing on circleci...

average-australia-85137

11/10/2021, 11:09 AM

hello! We recently had a build failing on circleci with:

Copy code

22:43:57.74 [INFO] Starting: installing node project dependencies
The pantsd process was killed during the run.

If this was not intentionally done by you, Pants may have been killed by the operating system due to memory overconsumption (i.e. OOM-killed). You can set the global option `--pantsd-max-memory-usage` to reduce Pantsd's memory consumption by retaining less in its in-memory cache (run `./pants help-advanced global`). You can also disable pantsd with the global option `--no-pantsd` to avoid persisting memory across Pants runs, although you will miss out on additional caching.

If neither of those help, please consider filing a GitHub issue or reaching out on Slack so that we can investigate the possible memory overconsumption (<https://www.pantsbuild.org/docs/getting-help>).
Traceback (most recent call last):
  File "/home/circleci/.cache/pants/setup/bootstrap-Linux-x86_64/2.7.0_py38/lib/python3.8/site-packages/pants/bin/pants_loader.py", line 95, in run_default_entrypoint
    exit_code = runner.run(start_time)
  File "/home/circleci/.cache/pants/setup/bootstrap-Linux-x86_64/2.7.0_py38/lib/python3.8/site-packages/pants/bin/pants_runner.py", line 86, in run
    return remote_runner.run()
  File "/home/circleci/.cache/pants/setup/bootstrap-Linux-x86_64/2.7.0_py38/lib/python3.8/site-packages/pants/bin/remote_pants_runner.py", line 99, in run
    return self._connect_and_execute(pantsd_handle)
  File "/home/circleci/.cache/pants/setup/bootstrap-Linux-x86_64/2.7.0_py38/lib/python3.8/site-packages/pants/bin/remote_pants_runner.py", line 131, in _connect_and_execute
    return PyNailgunClient(port, executor).execute(command, args, modified_env)
native_engine_pyo3.PantsdClientException: The pantsd process was killed during the run.

Memory usage stayed reasonable throughout the build so it wasn’t being killled by our OOM killer The core dump shows that it was instead killed with sigabort so… I assume the call was coming from inside the house somewhere

Copy code

circleci@3b6b9430e214:~/project$ eu-readelf --notes  core.1609.\!home\!circleci\!.cache\!pants\!setup\!bootstrap-Linux-x86_64\!pants.y9Lzwt\!install\!bin\!python3.8  | head

Note segment of 151880 bytes at offset 0x97d8:
  Owner          Data size  Type
  CORE                 336  PRSTATUS
    info.si_signo: 6, info.si_code: 0, info.si_errno: 0, cursig: 6
    sigpend: <>
    sighold: <>
    pid: 1779, ppid: 1, pgrp: 1608, sid: 1608
    utime: 0.004000, stime: 0.004000, cutime: 1.475434, cstime: 0.262632
    orig_rax: 14, fpvalid: 1

Removing the pants cache & re-running allowed the build to go through but I’m curious as to what would be causing the process to abort?

average-australia-85137

11/11/2021, 10:46 AM

Wondering if anybody has any ideas about this? should I just wait for it to happen again? (I think at least the error message could be better)

curved-television-6568

11/11/2021, 10:49 AM

I think they’ve had a holiday day.. could explain the lack of response thus far…

average-australia-85137

11/11/2021, 12:22 PM

ah thanks

enough-analyst-54434

11/11/2021, 4:30 PM

That OOM paragraph is unfortunate for anything other than a SIGKILL. I'd guess you may be hitting https://github.com/pantsbuild/pants/issues/12831 but the only common bit is the SIGABRT; so it's unclear. Now that we've had another report I need to dig and get to the bottom since I think I'm still the only maintainer that can repro.

average-australia-85137

11/16/2021, 6:18 PM

oooo thank you! let me know if I can provide any more info that would help

average-australia-85137

01/10/2022, 8:27 PM

@enough-analyst-54434 FYI this just happened again! super weird. Let me know if I can help at all

enough-analyst-54434

01/10/2022, 8:30 PM

I apologize, but I dropped my investigation after sinking several days into it with no results. What's needed is me or someone else that can reproduce the error in some way (I can with normal runs on my Linux laptop) to pick this back up and be ready to not drop it; i.e.: potentially tough out multiple days in a row of doing nothing but debugging this.

enough-analyst-54434

01/10/2022, 8:31 PM

I was basically a bottleneck on too many things.

average-australia-85137

01/10/2022, 8:33 PM

yeah totally understandable and the weird part is... we can't really reproduce it either since it only appears on our CI builds & clears up if the cache changes i could possibly get you a core dump if... that would be useful?

enough-analyst-54434

01/10/2022, 8:36 PM

It could be if its different from the one in https://github.com/pantsbuild/pants/issues/12831. I was not able to wrest the cause from that one even after using custom versions of libraries on that backtrace to drill into causes.

enough-analyst-54434

01/10/2022, 8:36 PM

Better safe than sorry though. If you've got'em please attach them.

average-australia-85137

01/12/2022, 3:20 PM

I am going through a process with our CTO if core dumps are safe to release! i'll attach here if allowed

average-australia-85137

01/12/2022, 3:20 PM

(it is 800 megs though idk if you have a different place i should upload it)

enough-analyst-54434

01/12/2022, 6:38 PM

Excellent. Thanks for your efforts here. Let's see if the upload works here (DM upload if sensitive). I'm open to any alternative that works for you and your company though.

enough-analyst-54434

01/19/2022, 8:50 PM

Ok, @average-australia-85137 provided a core dump and the Docker image it was generated in. His case looks like so:

Copy code

(gdb) bt
#0  raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  <signal handler called>
#2  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#3  0x00007f2702c73859 in __GI_abort () at abort.c:79
#4  0x00007f270113008c in mdb_assert_fail (env=0x7ee3ae47a930, expr_txt=expr_txt@entry=0x7f2701199393 "rc == 0", func=func@entry=0x7f2701199728 <__func__.16> "mdb_page_dirty", line=line@entry=2121, 
    file=0x7f2701198e20 "/github/home/.cargo/git/checkouts/lmdb-rs-369bfd26153a2575/06bdfbf/lmdb-sys/lmdb/libraries/liblmdb/mdb.c") at /github/home/.cargo/git/checkouts/lmdb-rs-369bfd26153a2575/06bdfbf/lmdb-sys/lmdb/libraries/liblmdb/mdb.c:1536
#5  0x00007f2700eaae65 in mdb_page_dirty (txn=<optimized out>, txn=<optimized out>, mp=<optimized out>) at /github/home/.cargo/git/checkouts/lmdb-rs-369bfd26153a2575/06bdfbf/lmdb-sys/lmdb/libraries/liblmdb/mdb.c:2108
#6  mdb_page_dirty (txn=0x7ee3ae77bbb0, mp=<optimized out>) at /github/home/.cargo/git/checkouts/lmdb-rs-369bfd26153a2575/06bdfbf/lmdb-sys/lmdb/libraries/liblmdb/mdb.c:2108
#7  0x00007f2700ead365 in mdb_page_alloc (num=num@entry=1, mp=mp@entry=0x7ee3b518cdf8, mc=<optimized out>, mc=<optimized out>) at /github/home/.cargo/git/checkouts/lmdb-rs-369bfd26153a2575/06bdfbf/lmdb-sys/lmdb/libraries/liblmdb/mdb.c:2302
#8  0x00007f2700ead582 in mdb_page_touch (mc=mc@entry=0x7ee3b518d320) at /github/home/.cargo/git/checkouts/lmdb-rs-369bfd26153a2575/06bdfbf/lmdb-sys/lmdb/libraries/liblmdb/mdb.c:2489
#9  0x00007f2700eaf14c in mdb_cursor_touch (mc=mc@entry=0x7ee3b518d320) at /github/home/.cargo/git/checkouts/lmdb-rs-369bfd26153a2575/06bdfbf/lmdb-sys/lmdb/libraries/liblmdb/mdb.c:6481
#10 0x00007f2700eb267f in mdb_cursor_put (mc=mc@entry=0x7ee3b518d320, key=key@entry=0x7ee3b518d790, data=data@entry=0x7ee3b518dc70, flags=<optimized out>, flags@entry=65552)
    at /github/home/.cargo/git/checkouts/lmdb-rs-369bfd26153a2575/06bdfbf/lmdb-sys/lmdb/libraries/liblmdb/mdb.c:6615
#11 0x00007f2700eb52a2 in mdb_put (txn=0x7ee3ae77bbb0, dbi=2, key=0x7ee3b518d790, data=0x7ee3b518dc70, flags=65552) at /github/home/.cargo/git/checkouts/lmdb-rs-369bfd26153a2575/06bdfbf/lmdb-sys/lmdb/libraries/liblmdb/mdb.c:8985
#12 0x00007f27009b8a3a in lmdb::transaction::RwTransaction::reserve (self=0x7ee3b518d788, database=..., key=..., len=12232, flags=...) at /github/home/.cargo/git/checkouts/lmdb-rs-369bfd26153a2575/06bdfbf/src/transaction.rs:303
#13 sharded_lmdb::ShardedLmdb::store::{{closure}}::{{closure}}::{{closure}} (txn=...) at /__w/pants/pants/src/rust/engine/sharded_lmdb/src/lib.rs:424
#14 core::result::Result<T,E>::and_then (self=..., op=...) at /rustc/a178d0322ce20e33eac124758e837cbd80a6f633/library/core/src/result.rs:704
#15 sharded_lmdb::ShardedLmdb::store::{{closure}}::{{closure}} () at /__w/pants/pants/src/rust/engine/sharded_lmdb/src/lib.rs:419
#16 task_executor::Executor::spawn_blocking::{{closure}} () at /__w/pants/pants/src/rust/engine/task_executor/src/lib.rs:166
...

And that looks like this: https://www.openldap.org/lists/openldap-devel/201710/msg00019.html The fix for that came in LMDB 0.9.23: https://git.openldap.org/openldap/openldap/-/blob/LMDB_0.9.23/libraries/liblmdb/CHANGES#L4 We're using a version of Rust lmdb-sys that vendors LMDB 0.9.21: https://github.com/pantsbuild/lmdb-rs/tree/06bdfbfc6348f6804127176e561843f214fc17f8/lmdb-sys

enough-analyst-54434

01/19/2022, 8:57 PM

We already have an issue to move off lmdb and on to lmdb-rkv which would get us to LMDB 0.9.24 and past this particular LMDB bug: https://github.com/pantsbuild/pants/issues/14115 cc @witty-crayon-22786

👍 2

enough-analyst-54434

01/19/2022, 9:06 PM

So, thanks Nate. Its not at all clear that this one coredump is representative of all the SIGABRT you see. We've seen other back tracesas outlined in https://github.com/pantsbuild/pants/issues/12831 and you could be hitting a mix over time.

witty-crayon-22786

01/19/2022, 9:13 PM

mm. yea, even more reason to fix that. sorry for the trouble Nate.

witty-crayon-22786

01/19/2022, 9:36 PM

i’ll scope that one quickly.

average-australia-85137

01/20/2022, 4:35 PM

Rad! Thanks so much for the really quick response - I'm pretty sure (at least on our end) this is the main SIGABRT we've been seeing (not under load & clearing the cache reliably fixes it)

6 Views

Open in Slack

Previous Next