PR for 2.16.0rc2 prep: <https://github.com/pantsbu...
# development
h
b
h
It can wait for rc3?
b
I was hoping to get my devs on rc2 😕
Thats ok I can put a filter in the log for this line
h
Gonna cut rc3 pretty soon I hope
I have to fix some stuff around tool lockfiles
b
okie dokie
w
https://github.com/pantsbuild/pants/pull/18979 is also bound for
2.16.x
, but seems to be reproducibly hitting a
git
failure on ARM: https://github.com/pantsbuild/pants/actions/runs/4954228969/jobs/8862538892?pr=18979 … not sure if that’s something we have a fix for on
main
: i haven’t seen it before.
…oh. your release prep is too. sigh.
the other wheel shard failures are due to https://pantsbuild.slack.com/archives/C0D7TNJHL/p1683832854711259 … i’ll get a fix out for those.
h
Yeah, I don't know what that github issue is, but since it's consistent, and only on the aarch64 shard, it must be that machine. I'll ssh into it later and poke around.
w
i’ve just fixed the macOS shards.
but yea, “tag you’re it” on ARM
h
It's a mystery so far
But likely something to do with networking inside the container
but that doesn't affect the x86 linux wheel shard, which also builds in a container...
I can at least now repro this
👍 1
Have isolated the problem to the bridge network:
Copy code
$ docker run -ti <http://ghcr.io/pantsbuild/wheel_build_aarch64:v3-8384c5cf|ghcr.io/pantsbuild/wheel_build_aarch64:v3-8384c5cf>  bash
> curl -L <https://cnn.com>  # fetches the CNN home page
$ docker network create mynet
$ docker run --network mynet -ti <http://ghcr.io/pantsbuild/wheel_build_aarch64:v3-8384c5cf|ghcr.io/pantsbuild/wheel_build_aarch64:v3-8384c5cf>  bash
> curl -L <https://cnn.com>  # blocks forever
(github creates and uses a bridge network in the same way when you run an action in a container)
w
hm. so,
docker
was recently upgraded on that machine, which was what caused it to lose
docker-init
, and need another package installed to be able to access
<http://docker.io|docker.io>
h
seems related
did you reboot after that installation?
w
i did not, no.
h
I'll give that a try
Hopefully the changes I made a few days ago will allow it to come back clean from the reboot
it did come up, but the reboot did not fix
Looks like the docker daemon is not creating the relevant iptables entries for the network, even though it should be doing so by default and its config file has not turned that off afaict
w
hm… actually. it looks like whatever you did recently resulted in the
<http://docker.io|docker.io>
package getting removed again as well.
h
Ugh maybe it's more than that, manually creating the MASQUERADE nat rule doesn't fix
WTF?
All I did was reboot
how did you install docker.io originally?
w
apt install <http://docker.io|docker.io>
h
It is installed, AFAICT
Did you just reinstall it?
w
hm. a shard that i just ran failed with the error we were getting before it was installed: https://github.com/pantsbuild/pants/actions/runs/4961310945/jobs/8877946260?pr=18995
apt-file search docker-init
suggested that that file came from
<http://docker.io|docker.io>
but perhaps they moved it again…
Copy code
apt-file update && apt-file search docker-init
h
that file is present
w
hm. yes.
and there is only one ARM worker?
h
Yep, it's that huge beefy machine with 80-something cores
w
i’ve managed to repro the
docker-init
issue using the docker CLI, so i think that it might be a docker daemon configuration issue
@happy-kitchen-89482: have you been editing the daemon’s config?
h
Nope
OK, I think the iptables thing is a red herring, looks like when using the custom bridge network the
nameserver
in the container's
/etc/resolv.conf
is wrong
w
@happy-kitchen-89482: i think that i need to edit
/lib/systemd/system/docker.service
and restart
dockerd
to fix this… is now a reasonable time?
h
To fix which part?
the docker-init part?
Sure, go ahead
I've filed a ticket with equinix, because something is very off with the docker daemon and networking
w
the
docker-init
issue. basically: the daemon is expecting
docker-init
on its PATH, but it isn’t there… it defaults to looking for it at
/usr/libexec/docker-init
, but
dockerd
can be instructed to look for it elsewhere with
dockerd --init-path
. but… wth.
h
it's not modifying the iptables AND it's not setting the nameserver correctly in the container
I think the docker daemon on that machine is just effed
w
yea. seems like a botched package or OS upgrade
h
so go ahead, I'm waiting for their support to chime in, I'm out of ideas
So that was a fun 5 hours of linux network debugging, takes me back to my Checkpoint days
😓 1
w
yea. back to the time before “immutable infrastructure”
uh… there are two `dockerd`s:
Copy code
# ps -ef | grep dockerd
root        2505       1  0 16:28 ?        00:00:24 dockerd --group docker --exec-root=/run/snap.docker --data-root=/var/snap/docker/common/var-lib-docker --pidfile=/run/snap.docker/docker.pid --config-file=/var/snap/docker/2857/config/daemon.json
root       97141       1  0 18:16 ?        00:00:01 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock --init-path /usr/bin/docker-init
… that doesn’t seem good?
🤯 1
h
yeesh
That definitely seems bad
I guess we kill the snap one?
w
yea… we have two copies. i wonder if that’s my fault due to attempting to install
<http://docker.io|docker.io>
with
apt
yesterday.
h
Oh, maybe
grrr
Have you uninstalled /usr/bin/docker{d}?
w
i ran
apt remove <http://docker.io|docker.io>
, so only the
snap
version should be installed now.
i don’t know where it actually lives, or how it launches.
but the
docker
cli is still there and still connects
h
Different docker cli
/snap/bin/docker
instead of
/usr/bin/docker
Still seeing the network issues, but it might be because configs got tangled up or whatever
I will reboot again, just to reset some state
👍 1
w
and i’m still seeing the
docker-init
issue, but now i know that my issue is with
snap
and not with
apt
=(
sure.
h
woohoo!
networking issue appears to be solved
gonna try something
OK, https://github.com/pantsbuild/pants/actions/runs/4954317612/jobs/8879527672 is now past the network problems and actually building wheels
w
huzzah
i’m losing my mind on this
docker-init
issue though. have no idea what should be installing it, as searching inside of
snap
packages is apparently not a thing
@happy-kitchen-89482: if i were to remove the
snap
version and swap to the
apt
version, would it re-break your networking fix?
i’ve burned enough time on this today to consider just disabling those tests on ARM. looking at how to do that too.
h
I didn't do anything to fix networking other than reboot after killing the apt dockerd
The two dockerds were stepping on each others toes in the iptables, looks like
so I think switching to apt is probably fine
w
ok.
that would be my preference, vs cherry-picking test skips everywhere. will try it.
yea,
snap remove docker && apt install <http://docker.io|docker.io>
worked. ffs.
i’m sure that it broke networking, but.
omg. now everything else is working i’m seeing a very frequent “bus error” on the ARM shards. this is driving me nuts.
@happy-kitchen-89482: do you do anything in particular to restart the machine, or just
reboot
?
h
I reboot from the equinix web console, but I assume
reboot
works just as well
w
yea, it came back up fine afaict. didn’t help with the bus error unfortunately.
i’m doing trying for the day.
h
Yeah, I'm done debugging docker on ARM for today