It looks like after upgrading to scie-pants, I'm a...
# general
f
It looks like after upgrading to scie-pants, I'm also experiencing this issue: https://github.com/pantsbuild/pants/issues/18135. I'm going to try disabling the daemon, but I'm not exactly sure if that's the best apporach?
That issue is closed, but maybe the real fix needs to happen here? https://github.com/pantsbuild/pants/issues/18159
e
That issue ends up having nothing to do with scie-pants (it just more reliably surfaces the underlying issue) and is fixed with modern Pants. What Pants version do you use?
f
currently:
Copy code
pants_version = "2.14.0"
e
By that issue I mean your 1st link.
Yeah, you are not on most modern 2.14
You should upgrade to latest 2.14.
f
Cool - I'll give that a shot
Is there a specific fix on 2.14.x that should address this?
e
f
(asking so I can update the first linked issue)
e
Yes.
f
Cool, yeah 2.14.2 fixed my problem. Thanks!
Actually @enough-analyst-54434, I'm still encountering this issue in my monorepo even with 2.14.2:
Copy code
FATAL: exception not rethrown
./pants: line 22: 40681 Aborted                 (core dumped) pants "$@"
Anything else to try? Disable the pants daemon?
e
Disabling the pants daemon is only useful as a diagnostic. Pants really is unuseable without pantsd in practice. If you get a core dump then that is useful to inspect in gdb for the thread backtraces to see what the issue is. Perhaps known and fixed, perhaps unknown.
f
Hmm, ok. It's transient and takes a long time to get the error to show up.
I think I need to go back to the pants wrapper vs.
scie-pants
. I never had this problem before I switched over.
e
I assure you that scie-pants is absolutely not the issue, it only surfaces underlying Pants issues.
This is a real Pants bug that aborts.
So, gdbing the core dump is ~the only useful path forward.
f
We've been operating without this error for about 8 months and it just showed up when we moved to
scie-pants
. So, I hear you, but I'm also going to try going back to not using it. If it doesn't happen then, I have to feel like it's related.
In a parallel thread, I can try to get the core dump. I'm not super familiar with Python - any tips/docs you can point me at?
e
This is not a Python thing, its a ~C thing. Just a sec...
The Pants core is written in Rust.
My Google: "get thread backtraces using gdb from a core dump" Actually - let me leave it there 1st. I have no clue your OS. Sometimes locating the core dump file is a battle in its own and that totally varies by OS / OS configuration.
f
k - it's a CircleCI linux machine executor
It gonna be interesting to figure out how to attach because this is the command I'm running:
echo "Previous deployment found: $most_recent_deployed_tag. Deploying only stacks that have changed since then."
pants --changed-since="$most_recent_deployed_tag" --changed-dependees=transitive --filter-tag-regex='^cdk_deploy$' list | parallel -L1 |' -P 1 pants run
e
Ok, so no attach.
A coredump leaves a file.
You gdb the file.
The process is long gone.
f
CCI has an "artifact upload" process that I could use to save the file(s)
e
Sure. So, there is a fork in the road here. I could ask you 20 questions - what OS?: Ubuntu, which version?: 22.04, Do you have systemd cordeump setup in that image or?: ... But its probably most fruitful for you to just dig in and, with the proper keywords now in hand, Google about 1. Getting a coredump file from whatever OS / version / config you use in Circle CI, 2. figure out how to snap the coredump file to some storage, 3. Google how to use gdb with a coredump file to get a full backtrace for all threads. What say you to these paths?
f
Sure, I can try to take this independent. I appreciate your thoughts on it!
I'm going to first go back away from scie-pants to try to get my client working again
(I know you say it won't fix it, but I'd be remiss not to try it)
e
Sounds good. And, to be super clear on scie-pants. It's just an independent binary that execs the `pants`console script in a venv; iow it does the equivalent of the bash command `pants.venv/bin/pants $@`and then is gone (since it execs). This is why the issue can never be scie-pants if
pants
runs at all. It can only be an underlying Pants bug.
So if
./pants
works for you but
scie-pants
gives a core-dump, the underlying core-dump issue persists and its just that you get lucky - aka a real bug is papered over. This was exactly the case with the bug you initially pointed at. An old bug brought to light by scie-pants and finally squashed.
f
I'm working on trying to stabilize the current state and then I can see if I can pull a coredump
An old bug brought to light by scie-pants and finally squashed.
Sure. It sounds like I'm using the wrong terminology here. From an end-user perspective "scie-pants isn't working" but in reality "maybe pants has been broken this whole time and scie-pants surfaces the issue".
e
Yeah. Unfortunately it's pretty key distinction for actually debugging the right thing.
f
I tried to get the coredump during an SSH session, but it I couldn't find the files. I'll need to do some more digging
I went back to the script and it's working again 😬 (not erroring / aborting / the "issue" is hidden)
But I think I found a way to reproduce it decently easily, so next week when I'm working with this client again, I'll try to get a coredump to figure out what's up.
Thanks for the pointers! I appreciate your time ❤️
@enough-analyst-54434, I managed to get a coredump from my CI machine - trying to figure out how to load it up in
gdb
now, but making progress!
_home_circleci_.cache_nce_c55ee58a557d20bd4b109870e5a01b264c0d501ce817cce29502b2552903834d_bindings_venvs_2.14.2_bin_pants.1001.crash
I'm struggling to get gdb to read it in to give me a backtrace, googling some more on it.
Hmm, yeah, even on the machine itself
gdb
can't seem to read it...
Copy code
Fatal Python error: FATAL: exception not rethrown
./pants: line 22: 10507 Aborted                 (core dumped) pants "$@"
produced
Copy code
/var/crash/_home_circleci_.cache_nce_c55ee58a557d20bd4b109870e5a01b264c0d501ce817cce29502b2552903834d_bindings_venvs_2.14.2_bin_pants.1001.crash
But then GDB can't seem to read it. I tried:
Copy code
gdb /home/circleci/bin/pants -c /var/crash/_home_circleci_.cache_nce_c55ee58a557d20bd4b109870e5a01b264c0d501ce817cce29502b2552903834d_bindings_venvs_2.14.2_bin_pants.1001.crash

Reading symbols from /home/circleci/bin/pants...
(No debugging symbols found in /home/circleci/bin/pants)
"/var/crash/_home_circleci_.cache_nce_c55ee58a557d20bd4b109870e5a01b264c0d501ce817cce29502b2552903834d_bindings_venvs_2.14.2_bin_pants.1001.crash" is not a core dump: file format not recognized
I also tried the downstream executable too:
Copy code
gdb /home/circleci/.cache/nce/c55ee58a557d20bd4b109870e5a01b264c0d501ce817cce29502b2552903834d/bindings/venvs/2.14.2/bin/pants /var/crash/_home_circleci_.cache_nce_c55ee58a557d20bd4b109870e5a01b264c0d501ce817cce29502b2552903834d_bindings_venvs_2.14.2_bin_pants.1001.crash
But same error 🤔
any pointers? based on the github issues, it seems like you've worked with this kinda thing a lot 🙂
e
Yeah, definitely not a core dump:
Copy code
$ file ~/support/pants/BenLimmer/_home_circleci_.cache_nce_c55ee58a557d20bd4b109870e5a01b264c0d501ce817cce29502b2552903834d_bindings_venvs_2.14.2_bin_pants.1001.crash
/home/jsirois/support/pants/BenLimmer/_home_circleci_.cache_nce_c55ee58a557d20bd4b109870e5a01b264c0d501ce817cce29502b2552903834d_bindings_venvs_2.14.2_bin_pants.1001.crash: ASCII text, with very long lines (43232)
f
Hmm, it's what
apport
was writing out. Maybe that's not what I wanted?
e
Has a line:
Copy code
CoreDump: base64
 H4sICAAAAAAC/0NvcmVEdW1wAA==...
So I think you know what you need to do.
That's probably the actual core dump.
f
ah, so maybe some fiddling with the file itself is necessary
e
Basically what you have is most definitely not a core dump.
Its some other wrapped up thing.
I know 0 about appport.
f
that's a helpful pointer - that helped me figure out what to google
e
Hopefully
file
was even more helpful. Great tool.
f
🙂 cool - I'm just about out of time, but I'll pick this back up next week. Thanks again for your continued help. Hopefully I'll have a backtrace to submit next week
have a good weekend
Well, I'm a bit further...
Copy code
apport-unpack _home_circleci_.cache_nce_c55ee58a557d20bd4b109870e5a01b264c0d501ce817cce29502b2552903834d_bindings_venvs_2.14.2_bin_pants.1001.crash unpack
unpacked and gave me a coredump But the backtrace isn't very interesting:
Copy code
Reading symbols from /home/circleci/bin/pants...
(No debugging symbols found in /home/circleci/bin/pants)
[New LWP 9799]
[New LWP 9770]
[New LWP 9798]
[New LWP 9795]
[New LWP 9764]
[New LWP 9794]
[New LWP 9797]
[New LWP 9796]
[New LWP 9771]
Core was generated by `/home/circleci/.cache/nce/c55ee58a557d20bd4b109870e5a01b264c0d501ce817cce29502b'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f4518c96a7c in ?? ()
[Current thread is 1 (LWP 9799)]
(gdb) bt
#0  0x00007f4518c96a7c in ?? ()
#1  0x00007f45151f6c10 in ?? ()
#2  0x00007f45151f6c18 in ?? ()
#3  0x00007f45151f6c28 in ?? ()
#4  0x0000000000000000 in ?? ()
Maybe I need to use a different version of
pants
that has the debugging symbols?
e
You should have a ton of output above the fold - maybe 10-20 items showing file not found - ~all .sos, right?
I did this:
Copy code
gdb -iex "set solib-search-path /home/jsirois/.cache/nce/c55ee58a557d20bd4b109870e5a01b264c0d501ce817cce29502b2552903834d/bindings/venvs/2.14.2/lib/python3.9/site-packages/pants/engine/internals:/home/jsirois/.cache/nce/2b6e146234a4ef2a8946081fc3fbfffe0765b80b690425a49ebe40b47c33445b/cpython-3.9.16+20230507-x86_64-unknown-linux-gnu-install_only.tar.gz/python/lib/" ~/.cache/nce/c55ee58a557d20bd4b109870e5a01b264c0d501ce817cce29502b2552903834d/bindings/venvs/2.14.2/bin/python3.9 CoreDump
But mind you - I had never done it before, I just read up.
👍 1
So I just remapped 2 of the ~10 missing sos, the 2 key ones, python3.9 and the Pants rust native engine.
@famous-river-94971 the other key is to make sure you're using the scie-pants python which I did there.
I get, for the main thread:
Copy code
$ gdb -iex "set solib-search-path /home/jsirois/.cache/nce/c55ee58a557d20bd4b109870e5a01b264c0d501ce817cce29502b2552903834d/bindings/venvs/2.14.2/lib/python3.9/site-packages/pants/engine/internals:/home/jsirois/.cache/nce/2b6e146234a4ef2a8946081fc3fbfffe0765b80b690425a49ebe40b47c33445b/cpython-3.9.16+20230507-x86_64-unknown-linux-gnu-install_only.tar.gz/python/lib/" ~/.cache/nce/c55ee58a557d20bd4b109870e5a01b264c0d501ce817cce29502b2552903834d/bindings/venvs/2.14.2/bin/python3.9 CoreDump
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <<http://gnu.org/licenses/gpl.html>>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<<https://www.gnu.org/software/gdb/bugs/>>.
Find the GDB manual and other documentation resources online at:
    <<http://www.gnu.org/software/gdb/documentation/>>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home/jsirois/.cache/nce/c55ee58a557d20bd4b109870e5a01b264c0d501ce817cce29502b2552903834d/bindings/venvs/2.14.2/bin/python3.9...

warning: Can't open file /home/circleci/.cache/nce/2b6e146234a4ef2a8946081fc3fbfffe0765b80b690425a49ebe40b47c33445b/cpython-3.9.16+20230507-x86_64-unknown-linux-gnu-install_only.tar.gz/python/bin/python3.9 during file-backed mapping note processing

warning: Can't open file /home/circleci/.cache/nce/c55ee58a557d20bd4b109870e5a01b264c0d501ce817cce29502b2552903834d/bindings/venvs/2.14.2/lib/python3.9/site-packages/yaml/_yaml.cpython-39-x86_64-linux-gnu.so during file-backed mapping note processing

warning: Can't open file /home/circleci/.cache/nce/c55ee58a557d20bd4b109870e5a01b264c0d501ce817cce29502b2552903834d/bindings/venvs/2.14.2/lib/python3.9/site-packages/pants/engine/internals/native_engine.so during file-backed mapping note processing

warning: Can't open file /home/circleci/.cache/nce/c55ee58a557d20bd4b109870e5a01b264c0d501ce817cce29502b2552903834d/bindings/venvs/2.14.2/lib/python3.9/site-packages/ujson.cpython-39-x86_64-linux-gnu.so during file-backed mapping note processing

warning: Can't open file /home/circleci/.cache/nce/c55ee58a557d20bd4b109870e5a01b264c0d501ce817cce29502b2552903834d/bindings/venvs/2.14.2/lib/python3.9/site-packages/setproctitle.cpython-39-x86_64-linux-gnu.so during file-backed mapping note processing

warning: Can't open file /home/circleci/.cache/nce/c55ee58a557d20bd4b109870e5a01b264c0d501ce817cce29502b2552903834d/bindings/venvs/2.14.2/lib/python3.9/site-packages/psutil/_psutil_posix.cpython-39-x86_64-linux-gnu.so during file-backed mapping note processing

warning: Can't open file /home/circleci/.cache/nce/c55ee58a557d20bd4b109870e5a01b264c0d501ce817cce29502b2552903834d/bindings/venvs/2.14.2/lib/python3.9/site-packages/psutil/_psutil_linux.cpython-39-x86_64-linux-gnu.so during file-backed mapping note processing

warning: Can't open file /home/circleci/.cache/nce/2b6e146234a4ef2a8946081fc3fbfffe0765b80b690425a49ebe40b47c33445b/cpython-3.9.16+20230507-x86_64-unknown-linux-gnu-install_only.tar.gz/python/lib/libpython3.9.so.1.0 during file-backed mapping note processing

warning: core file may not match specified executable file.
[New LWP 11737]
[New LWP 11720]
[New LWP 11734]
[New LWP 11721]
[New LWP 11732]
[New LWP 11738]
[New LWP 11736]
[New LWP 11714]
[New LWP 11735]

warning: Could not load shared library symbols for 5 libraries, e.g. /home/circleci/.cache/nce/c55ee58a557d20bd4b109870e5a01b264c0d501ce817cce29502b2552903834d/bindings/venvs/2.14.2/lib/python3.9/site-packages/psutil/_psutil_linux.cpython-39-x86_64-linux-gnu.so.
Use the "info sharedlibrary" command to see the complete listing.
Do you need "set solib-search-path" or "set sysroot"?
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/home/circleci/.cache/nce/c55ee58a557d20bd4b109870e5a01b264c0d501ce817cce29502b'.
Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=139738483492416) at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory.
[Current thread is 1 (Thread 0x7f1766ab8640 (LWP 11737))]
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=139738483492416) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=139738483492416) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=139738483492416, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007f176a242476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007f176a2287f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007f176a28945c in __libc_message (action=do_abort, fmt=0x7f176a3db7b1 "%s", fmt=0x7f176a3db7b1 "%s", action=do_abort) at ../sysdeps/posix/libc_fatal.c:155
#6  0x00007f176a289770 in __GI___libc_fatal (message=<optimized out>) at ../sysdeps/posix/libc_fatal.c:164
#7  0x00007f176a29d476 in unwind_cleanup (reason=<optimized out>, exc=<optimized out>) at ./nptl/unwind.c:114
#8  0x00007f176973a5cf in panic_unwind::real_imp::cleanup () at library/panic_unwind/src/gcc.rs:78
#9  panic_unwind::__rust_panic_cleanup () at library/panic_unwind/src/lib.rs:100
#10 0x00007f1768d398da in std::panicking::try::cleanup () at library/std/src/panicking.rs:473
#11 0x00007f1769629f06 in std::panicking::try::do_catch<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<tokio::runtime::blocking::pool::{impl#4}::spawn_thread::{closure_env#0}, ()>>, ()> (payload=0x0, data=<optimized out>) at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/std/src/panicking.rs:517
#12 std::panicking::try<(), core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<tokio::runtime::blocking::pool::{impl#4}::spawn_thread::{closure_env#0}, ()>>> (f=<error reading variable: Cannot access memory at address 0x0>) at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/std/src/panicking.rs:456
#13 std::panic::catch_unwind<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<tokio::runtime::blocking::pool::{impl#4}::spawn_thread::{closure_env#0}, ()>>, ()> (f=<error reading variable: Cannot access memory at address 0x0>) at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/std/src/panic.rs:137
#14 std::thread::{impl#0}::spawn_unchecked_::{closure#1}<tokio::runtime::blocking::pool::{impl#4}::spawn_thread::{closure_env#0}, ()> ()
    at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/std/src/thread/mod.rs:504
#15 core::ops::function::FnOnce::call_once<std::thread::{impl#0}::spawn_unchecked_::{closure_env#1}<tokio::runtime::blocking::pool::{impl#4}::spawn_thread::{closure_env#0}, ()>, ()> ()
    at /rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/ops/function.rs:248
#16 0x00007f176972fb33 in alloc::boxed::{impl#44}::call_once<(), dyn core::ops::function::FnOnce<(), Output=()>, alloc::alloc::Global> () at library/alloc/src/boxed.rs:1951
#17 alloc::boxed::{impl#44}::call_once<(), alloc::boxed::Box<dyn core::ops::function::FnOnce<(), Output=()>, alloc::alloc::Global>, alloc::alloc::Global> () at library/alloc/src/boxed.rs:1951
#18 std::sys::unix::thread::{impl#2}::new::thread_start () at library/std/src/sys/unix/thread.rs:108
#19 0x00007f176a294b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#20 0x00007f176a326a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb)
f
Awesome - thank you for all your help with this! I'll open a ticket on the
pants
repo
Here's the issue I filed: https://github.com/pantsbuild/pants/issues/19269. Also, I left instructions under the "CircleCI CrashDump Parsing Steps" heading on that bug report in case you need to help future folks using CircleCI in getting backtraces ❤️
🙏 1
e
@famous-river-94971 I think you went on a wild goose chase for an already solved problem unfortunately. The fix you thought was in 2.14.2 never was. It's only in 2.15.1+.
Thanks for being game to dig into the gnarly world of getting a core from CI and using it though.
f
noooooooooo
haha, just kidding 🙂
I learned a lot and really appreciate the time you spent holding my hand through the process.
Plus, silver-lining, we've got decent steps documented for future CircleCI users
e
Well, I will say the latter is not true. Its image dependent. But, presumably, most folks use whatever image you use (most folks don't fiddle much). The core dump process is not CI-specific, its OS specific and varies from MacOS versions to Linux distros and even versions within the same distro if they switch, say, from something to systemd, etc.
FWIW most of my life is doing what you just did, but with no one holding my hand and when my partner asks me what I did at work I've shortened it now to "dumb stuff" after hearing what she did in the ICU to help a very sick patient.
I spent 3 days doing dumb stuff is so common for me!
f
Well, I will say the latter is not true. Its image dependent.
It's not not true entirely. Many people will be using the ubuntu machine images for CircleCI.
But, point taken, it won't work in every case.
e
Yeah:
But, presumably, most folks use whatever image you use (most folks don't fiddle much).
👍 1
I just finished 4 days of dumb stuff myself learning, re-learning, ... how Sphinx works to write an extension that can generate new doc files. Dumb stuff!
f
Always a good time.