Getting an odd error when trying to launch pants (...
# general
f
Getting an odd error when trying to launch pants (inside a toolbox container):
Copy code
Failed to launch child `/home/.../.cache/bootstrap-Linux-x86_64/2.6.1rc3_py39/bin/pants`: Os { code: 13, kind: PermissionDenied, message: "Permission denied" }
Even going to the resolved bin directory in my shell and running
./pants
fails with the same error. Any idea what might be going on? What's it trying to launch internally there?
w
that is the
pants
script trying to use the venv that it created containing pants
i… am surprised it made it to the point of trying to run if that directory isn’t accessible.
but there isn’t any per-repo state pointing out into that directory, so the pants script would have had to actually make it fairly far.
can add
set -x
to the top of the
pants
script for more info.
f
it happens even in the directory itself
Untitled
that's not the
./pants
shell script, that's the little python launcher script
Copy code
❯ ./python -c 'from pants.bin.pants_loader import main; main()'
Failed to launch child `-c`: Os { code: 13, kind: PermissionDenied, message: "Permission denied" }
h
Wondering if this might be related to low ulimit, try
ulimit -n 10000
in the container?
f
already higher than that
Copy code
❯ ulimit -n
524288
btw, it works outside the container; it might be related to how the container mounts work, i'll keep digging on my end
it would be useful to know, in summary, what
pants.bin.pants_loader:main
does
w
all it does is call the
main()
method… it seems like there is likely an issue with the python in the virtualenv.
that
python
is a symlink to some other python
can confirm by trying to run other things with it, and see whether there is anything fishy about how it is installed?
f
ah that makes sense...it may have found a python that isn't in the container
well idk, it
./python -m pex --help
works fine (in the $VIRTUALENV/bin directory)
seems like it hits during the pants loader
Copy code
❯ ./python (realpath pants)
Failed to launch child `/home/pants-home/.cache/bootstrap-Linux-x86_64/pants.fRK9AN/install/bin/pants`: Os { code: 13, kind: PermissionDenied, message: "Permission denied" }
Copy code
❯ ./python -m pants
Failed to launch child `/home/pants-home/.cache/bootstrap-Linux-x86_64/pants.fRK9AN/install/lib64/python3.9/site-packages/pants/__main__.py`: Os { code: 13, kind: PermissionDenied, message: "Permission denied" }
i tried to strace this but it was way too much info
hmm it's definitely related to launching pantsd
w
interesting. if you disable?
f
./pants --version --no-pantsd
works fine
w
interesting…
f
it stalls waiting for pantsd to start if i try to launch pants more manually...
Copy code
❯ ps axu | grep pantsd
jreed    2819075  0.0  0.0   6176  2260 pts/0    S+   15:48   0:00 grep --color=auto pantsd
░▒▓    /home/pants-home/.cache/bootstrap-Linux-x86_64/2.6.1rc3_py39/bin ·············································· 15:48:13 ▓▒░
❯ ./pants --version 
Failed to launch child `./pants`: Os { code: 13, kind: PermissionDenied, message: "Permission denied" }
░▒▓    /home/pants-home/.cache/bootstrap-Linux-x86_64/2.6.1rc3_py39/bin ·············································· 15:48:24 ▓▒░
❯ ps axu | grep pantsd
jreed    2819188 21.7  0.1 282021188 63384 ?     Sl   15:48   0:00 pantsd [/home/pants-home/.cache/bootstrap-Linux-x86_64/pants.fRK9AN/install/bin]
jreed    2819254  0.0  0.0   6176  2196 pts/0    S+   15:48   0:00 grep --color=auto pantsd
░▒▓    /home/pants-home/.cache/bootstrap-Linux-x86_64/2.6.1rc3_py39/bin ·············································· 15:48:27 ▓▒░
❯ pkill pantsd
░▒▓    /home/pants-home/.cache/bootstrap-Linux-x86_64/2.6.1rc3_py39/bin ·············································· 15:48:35 ▓▒░
❯ ./python -c 'from pants.bin.pants_loader import main; main()'
Argument expected for the -c option
usage: /home/pants-home/.cache/bootstrap-Linux-x86_64/pants.fRK9AN/install/bin/python [option] ... [-c cmd | -m mod | file | -] [arg] ...
Try `python -h' for more information.
15:48:47.92 [INFO] waiting for pantsd to start...

... snip ...

    raise cls.Timeout(
pants.pantsd.process_manager.ProcessManager.Timeout: exceeded timeout of 60 seconds while waiting for pantsd to start
note that the "normal" startup does actually start pantsd... but it can't seem to get access to it or get informed that it's working
i'll disable pantsd in pants.toml for now but this is weird
f
yeah doesn't look too crazy
I remember having problems with pantsd in a container before though; it may be a matter of stuff getting shared between the container and host and not realizing it
toolbox does some funny stuff to make working in a container "seamless" for a good chunk of work, but it definitely has its problems and limitations, especially when you try to share things via mounts
it shares your whole home dir with the container, for example, which certainly makes a few things easier, but at the expense of making some things really confusing, insecure, and contrary to some of its design intentions but ¯\_(ツ)_/¯
w
mmm… yea. that’s pretty trusting of apps to have the right behavior. unclear who is to blame in this case, but the double fork that
pantsd
does tries to determine which args to use on the second fork, and that might be failing: https://github.com/pantsbuild/pants/blob/bfb11d765c566c97d7b90e9bfede291cdc526da1/src/python/pants/pantsd/process_manager.py#L540-L559
in general,
sys.executable
should be the right thing to do there though.
f
i'll dive deeper into this soon; i'm finally doing a PoC on this repo, and with its size, we will definitely need pantsd
❤️ 2
w
thanks a lot.
f
Trying to run this is a debugger seems almost impossible because of how reliant this launch script is on the args... but debug logs suggest that it fails later', as it seems to succeed after logging the line in that function from
process_manager
Copy code
❯ ./pants --pantsd -ldebug
11:33:54.37 [DEBUG] acquiring lock: <pants.pantsd.lock.OwnerPrintingInterProcessFileLock object at 0x7ff71e358ac0>
11:33:54.37 [DEBUG] purging metadata directory: /home/jreed/devel/aiven/aiven-core/.pids/3dcc53364e8c/pantsd
11:33:54.37 [DEBUG] Launching pantsd
11:33:54.37 [DEBUG] purging metadata directory: /home/jreed/devel/aiven/aiven-core/.pids/3dcc53364e8c/pantsd
11:33:54.37 [DEBUG] pantsd command is: PANTS_DAEMON_ENTRYPOINT=pants.pantsd.pants_daemon:launch_new_pantsd_instance PYTHONPATH=/home/jreed/.cache/pants/setup/bootstrap-Linux-x86_64/pants.GlUYnI/install/bin:/usr/lib64/python39.zip:/usr/lib64/python3.9:/usr/lib64/python3.9/lib-dynload:/home/jreed/.cache/pants/setup/bootstrap-Linux-x86_64/2.6.1rc3_py39/lib64/python3.9/site-packages:/home/jreed/.cache/pants/setup/bootstrap-Linux-x86_64/2.6.1rc3_py39/lib/python3.9/site-packages /home/jreed/.cache/pants/setup/bootstrap-Linux-x86_64/2.6.1rc3_py39/bin/python /home/jreed/.cache/pants/setup/bootstrap-Linux-x86_64/2.6.1rc3_py39/bin/pants --pants-bin-name=./pants --pants-version=2.6.1rc3 --pantsd -ldebug
11:33:55.37 [DEBUG] pantsd is running at pid 2968100, pailgun port is 42157
11:33:55.37 [DEBUG] releasing lock: <pants.pantsd.lock.OwnerPrintingInterProcessFileLock object at 0x7ff71e358ac0>
11:33:55.37 [DEBUG] Connecting to pantsd on port 42157
11:33:55.37 [DEBUG] Connecting to pantsd on port 42157 attempt 1/3
Failed to launch child `/home/jreed/.cache/pants/setup/bootstrap-Linux-x86_64/2.6.1rc3_py39/bin/pants`: Os { code: 13, kind: PermissionDenied, message: "Permission denied" }
pants.d/pantsd.log shows a connection:
Copy code
11:38:23.03 [DEBUG] pantsd running with PID: 2971133
11:38:23.03 [DEBUG] Accepted connection: PollEvented { io: Some(TcpStream { addr: 127.0.0.1:33795, peer: 127.0.0.1:51744, fd: 161 }) }
w
Does the server stay up between runs?
If so, you might be able to attach a debugger or strace to it?
Also, obviously wouldn't recommend it in general but: you can modify pants's code inside of its virtualenv to add debug output...
f
yeah it stays up... but it isn't what fails... it's something in the launcher
and yeah i figure i'll have to modify the launcher to figure out what's going on
don't have a whole lot of digging today but it's not just rootless containers or mounts of the cache file where this isn't working; seems like even with
sudo podman
with no mounts it's a problem. Might be something related to podman or crun or cgroups2 or some other element of my container stack. Next i'll try to reproduce with a systemd-nspawn container, if it's a problem with that then it's probably an issue with cgroups2 or fundamental container privileges
I suspect we're hitting some capability-related issue with the user namespace mechanism my system is using... I can broadly test that by trying things in a privileged container
--privileged seems to have nothing to do with it, but whether installation happens in a dockerfile or not does... ?
Copy code
FROM ubuntu:latest

RUN useradd -Um --shell /bin/bash jreed

RUN apt-get update \
    && apt-get install -y \
        curl \
        sudo \
        python3-pip \
        python3-virtualenv \
        aptitude


WORKDIR /home/jreed
USER jreed
RUN mkdir test \
    && cd test \
    && printf '[GLOBAL]\npants_version = "2.6.1rc3"\n' > pants.toml \
    && curl -L -o ./pants <https://static.pantsbuild.org/setup/pants> \
    && chmod +x ./pants
results in a working pants install (doesn't throw the permission denied error). However the same commands run in a container launched via
podman run -it ubuntu bash
(literally copy-pasting these commands, except replacing WORKDIR and USER with
sudo -u jreed -i
) results in the error described above... Any ideas? This is getting weird
hmmm looks like it's definitely the
sudo
itself here that's at issue (same with su or gksu)
okay enough for today, will revisit later; if anyone has any ideas related to this, would appreciate any inputs I can get!
w
…oh, yikes. yes, sudo is a known issue i think…?
pantsd tries to connect directly to the client’s TTY
the error message has apparently gotten worse, but i think that this is: https://github.com/pantsbuild/pants/issues/5664
if that’s the case, i think that we could do a better job here of either falling back if we fail to open it, or … something else. commented on https://github.com/pantsbuild/pants/issues/5664
f
It's not sudo in my original area but it may be something with the TTY. Toolbox does some funny things to get user stuff to work, so there may be some issue with that. I'll look at the issue when I get a chance. Thanks for the info!
w
a thing to check is the workaround on that ticket
✔️ 1
f
the
./pants | cat
workaround does not work
Copy code
jreed@7b07981cb3a6:~/test$ ./pants --version | cat
Failed to launch child `/home/jreed/.cache/pants/setup/bootstrap-Linux-x86_64/2.6.1rc3_py38/bin/pants`: Os { code: 13, kind: PermissionDenied, message: "Permission denied" }
I don't think it's the same issue. Pantsd seems to start up, and was logging connections even in one of my cases. Seems to be the main pants process that fails. I still haven't determined exactly where this error happens in the code, mostly because I can't reproduce it consistently in my debugger
I came back to this today and had the same issue with gosu , so I really don't think it's TTY related
also I explored the process tree where I was having trouble with this in my "toolbox" container, and it doesn't seem to have any of this in it, but it's possible that it's simply the way the container runtime is being invoked that causes... whatever this issue is
It's gotta be an issue with my container stack... I'm not encountering this at all outside containers. On the host machine I can run multiple levels of sudo with no problem. I just wish I had more insight as to what exactly the problem was so I could look for a better solution.
Do y'all have any ideas on how to pinpoint exactly where that error message is being generated? I don't see anything in the pants code that looks like it's producing that error message.
w
it looks like an error message from the rust code, which is what is responsible for managing the server’s pantsd socket: https://github.com/pantsbuild/pants/blob/6e044e9816a544a3df229b2cc8ac6e83d6c379a8/src/rust/engine/nailgun/src/server.rs#L270-L328
(there are lots of other potentially relevant codepaths, but that’s an important spot related to connection establishment)
@flat-zoo-31952: if a repro is possible without oodles of framework, i’d be happy to take a look at a ticket?
f
oh that actually makes a lot of sense... the permission is most likely about opening the socket to pantsd (pantsd does actually start running)
w
either on the client or the server, yea… i suspect the server, since it connects back to the TTY of the client
the client only ever opens an actual socket to the server
and i don’t think that there should be permissions errors there.
f
it's the client that dies though, pantsd keeps running
w
yea. the server should keep running after a failed connection attempt in the codepath i linked.
f
ah okay... as for a rerpo... what are you looking for? it's hard to nail down a minimal case for it, and like I said, it seems related to my container stack (which is pretty different from docker in the way it's implemented. I think it's mostly podman, conmon & crun that are implicated in the path here)
w
mm. yea, if the repro wouldn’t use OSS components, then nevermind.
f
it's all OSS, it's just fedora-33 +
dnf install podman
, it's just that the crun OCI runtime ecosystem is a lot more modular than typical docker setups, and podman is daemonless and rootless, and it's all based on systemd containers, idk it makes my head spin a bit
w
are you able to build custom Pants binaries to use here?
f
yeah I should be able to, might need some pointers, but I supposed I can modify that rust code you linked and rebuild
w
ok… the fact that this doesn’t mention a filename, or any other context makes this a very whack-a-mole situation, unfortunately (there is a very old ticket open for rust about including the path by default, and it’s one of those situations where they came down on the side of performance rather than usability, unfortunately.)
f
it mentions a filename...
Copy code
jreed@aa110e26cc53:~/test$ ./pants -ldebug
20:28:48.87 [DEBUG] acquiring lock: <pants.pantsd.lock.OwnerPrintingInterProcessFileLock object at 0x7f91398edd00>
20:28:48.87 [DEBUG] purging metadata directory: /home/jreed/test/.pids/7ac24caa98f2/pantsd
20:28:48.87 [DEBUG] Launching pantsd
20:28:48.87 [DEBUG] purging metadata directory: /home/jreed/test/.pids/7ac24caa98f2/pantsd
20:28:48.87 [DEBUG] pantsd command is: PANTS_DAEMON_ENTRYPOINT=pants.pantsd.pants_daemon:launch_new_pantsd_instance PYTHONPATH=/home/jreed/.cache/pants/setup/bootstrap-Linux-x86_64/pants.ZVUioG/install/bin:/usr/lib/python38.zip:/usr/lib/python3.8:/usr/lib/python3.8/lib-dynload:/home/jreed/.cache/pants/setup/bootstrap-Linux-x86_64/2.7.0_py38/lib/python3.8/site-packages /home/jreed/.cache/pants/setup/bootstrap-Linux-x86_64/2.7.0_py38/bin/python /home/jreed/.cache/pants/setup/bootstrap-Linux-x86_64/2.7.0_py38/bin/pants --pants-bin-name=./pants --pants-version=2.7.0 -ldebug
20:28:50.07 [DEBUG] pantsd is running at pid 46, pailgun port is 46277
20:28:50.07 [DEBUG] releasing lock: <pants.pantsd.lock.OwnerPrintingInterProcessFileLock object at 0x7f91398edd00>
20:28:50.07 [DEBUG] Connecting to pantsd on port 46277
20:28:50.07 [DEBUG] Connecting to pantsd on port 46277 attempt 1/3
Failed to launch child `/home/jreed/.cache/pants/setup/bootstrap-Linux-x86_64/2.7.0_py38/bin/pants`: Os { code: 13, kind: PermissionDenied, message: "Permission denied" }
but that file is just the python entrypoint script for the pants module
it looks more like a fork failing
w
the
Failed to launch child
error is generic… i’m not sure where that is coming from
🤔 1
oh: i think i asked this before, but: if you’re able to attach
strace
to
pantsd
before the client connects, that might shine some light
f
how would i do that.... by the time the PID gets output I think the connect is already happening
w
if
pantsd
stays up, you can launch it, attach, then try again
AH. ‘Failed to launch child’ is not generic. i know where that is coming from.
it comes from the server: and it has nothing to do with forking in this case… it’s generic to the nailgun server library that we use: https://github.com/stuhood/nails/blob/ef93c3ffd701c0fc9e33916d31a44006f6ce51eb/nails/src/server.rs#L106-L116
f
also this is interesting...
Copy code
jreed@aa110e26cc53:~/test$ strace -p 46
strace: test_ptrace_get_syscall_info: PTRACE_TRACEME: Operation not permitted
strace: Could not attach to process. If your uid matches the uid of the target process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf: Operation not permitted
strace: attach: ptrace(PTRACE_ATTACH, 46): Operation not permitted
w
yea, you would need to change your strace/ptrace settings.
…but yea, this is almost certainly this code: https://github.com/pantsbuild/pants/blob/6e044e9816a544a3df229b2cc8ac6e83d6c379a8/src/rust/engine/nailgun/src/server.rs#L270-L328 i can open an issue to try and fall back in that case.
f
my ptrace setting is 0
w
i think that you can skip that: it’s definitely this codepath.
f
strace doesn't show anything trying to talk to pantsd...
Copy code
strace: Process 1828522 attached
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=880102}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=1, tv_usec=0}) = 0 (Timeout)
...
That same empty select loop goes on forever. Subsequent child pants processes don't ever get to connecting to the socket it seems:
Copy code
{ code: 13, kind: PermissionDenied, message: "Permission denied" }
jreed@aa110e26cc53:~/test$ ./pants -ldebug --version
23:48:56.42 [DEBUG] acquiring lock: <pants.pantsd.lock.OwnerPrintingInterProcessFileLock object at 0x7fb7eab99b80>
23:48:56.42 [DEBUG] releasing lock: <pants.pantsd.lock.OwnerPrintingInterProcessFileLock object at 0x7fb7eab99b80>
23:48:56.42 [DEBUG] Connecting to pantsd on port 42355
23:48:56.42 [DEBUG] Connecting to pantsd on port 42355 attempt 1/3
Failed to launch child `/home/jreed/.cache/pants/setup/bootstrap-Linux-x86_64/2.7.0_py38/bin/pants`: Os { code: 13, kind: PermissionDenied, message: "Permission denied" }
what kind of socket does the client open to the server? seems like some kinda unix domain socket? there are some weird permissions things that go on with my containers, and i'm beginning to suspect more and more that there's something bizarre happening there
no it looks like it's TCP port...which i can connect to just with
nc -v localhost 42355
and even with that the strace output of the pantsd process doesn't change. Seems really surprising. I guess tomorrow I'll open a formal ticket with the full repro steps and environment. I'm also open to rebuilding pants as a debugging step
w
Just TCP.
f
i'm no expert at reading strace output, but shouldnt the select call have a reference to the fds it's read/writing to?
Copy code
int select(int nfds, fd_set *readfds, fd_set *writefds,
                  fd_set *exceptfds, struct timeval *timeout);
w
I don't think you need to: this is absolutely the code I linked above. I can have it fallback when it fails to open the TTY
😊 1
f
If it's this code, do you have an idea why the
| cat
workaround doesn't work for me?
w
might need
2>&1
to also pipe stderr to stdout
f
no dice
Copy code
jreed@aa110e26cc53:~/test$ ./pants --version 2>&1 | cat
Failed to launch child `/home/jreed/.cache/pants/setup/bootstrap-Linux-x86_64/2.7.0_py38/bin/pants`: Os { code: 13, kind: PermissionDenied, message: "Permission denied" }
hey, it works with stdin, apparently 😄
Copy code
jreed@aa110e26cc53:~/test$ echo | ./pants --version 2>&1 | cat
16:18:25.21 [INFO] Initializing scheduler...
16:18:25.25 [INFO] Scheduler initialized.
16:18:25.27 [WARN] Please either set `enabled = true` in the [anonymous-telemetry] section of pants.toml to enable sending anonymous stats to the Pants project to aid development, or set `enabled = false` to disable it. No telemetry sent for this run. An explicit setting will get rid of this message. See <https://www.pantsbuild.org/v2.7/docs/anonymous-telemetry> for details.
2.7.0
w
huzzah 😃
f
so i just have to shield it from anything that might be a tty 🛡️
w
yea, and i think that this codepath can definitely have fallback. sorry for all of the trouble there
f
meh y'all kick ass, I don't know any other dev teams I can bug about this kinda stuff for weeks and get them to care about it
👏 1
😊 1
Seriously, the level of transparency and communication here gives me immense confidence in pants as a tool, because I have confidence in the team behind it
h
Can we quote you on that? 🙂
f
haha sure
b
@flat-zoo-31952 if you would, tweet it. As you mentioned the other day, there's nothing so compelling as an experience report by an actual peer.
f
Sadly I don't have the constitution to be active on Twitter 😢
b
lol. Understandable.
w
@flat-zoo-31952: are you likely to be on 2.7.x for a while? can cherry-pick this.
f
Yeah probably. But I also have a decent workaround now
w
there is a 2.7.1 release in the pipeline anyway: will get it picked.
🙏🏻 1
🍒 1