https://pantsbuild.org/ logo
#development
Title
# development
h

happy-kitchen-89482

05/12/2023, 1:51 AM
b

bitter-ability-32190

05/12/2023, 1:55 AM
h

happy-kitchen-89482

05/12/2023, 2:04 AM
It can wait for rc3?
b

bitter-ability-32190

05/12/2023, 2:04 AM
I was hoping to get my devs on rc2 😕
Thats ok I can put a filter in the log for this line
h

happy-kitchen-89482

05/12/2023, 2:12 AM
Gonna cut rc3 pretty soon I hope
I have to fix some stuff around tool lockfiles
b

bitter-ability-32190

05/12/2023, 2:12 AM
okie dokie
w

witty-crayon-22786

05/12/2023, 3:24 AM
https://github.com/pantsbuild/pants/pull/18979 is also bound for
2.16.x
, but seems to be reproducibly hitting a
git
failure on ARM: https://github.com/pantsbuild/pants/actions/runs/4954228969/jobs/8862538892?pr=18979 … not sure if that’s something we have a fix for on
main
: i haven’t seen it before.
…oh. your release prep is too. sigh.
the other wheel shard failures are due to https://pantsbuild.slack.com/archives/C0D7TNJHL/p1683832854711259 … i’ll get a fix out for those.
h

happy-kitchen-89482

05/12/2023, 3:51 AM
Yeah, I don't know what that github issue is, but since it's consistent, and only on the aarch64 shard, it must be that machine. I'll ssh into it later and poke around.
w

witty-crayon-22786

05/12/2023, 3:52 AM
i’ve just fixed the macOS shards.
but yea, “tag you’re it” on ARM
h

happy-kitchen-89482

05/12/2023, 3:15 PM
It's a mystery so far
But likely something to do with networking inside the container
but that doesn't affect the x86 linux wheel shard, which also builds in a container...
I can at least now repro this
👍 1
Have isolated the problem to the bridge network:
Copy code
$ docker run -ti <http://ghcr.io/pantsbuild/wheel_build_aarch64:v3-8384c5cf|ghcr.io/pantsbuild/wheel_build_aarch64:v3-8384c5cf>  bash
> curl -L <https://cnn.com>  # fetches the CNN home page
$ docker network create mynet
$ docker run --network mynet -ti <http://ghcr.io/pantsbuild/wheel_build_aarch64:v3-8384c5cf|ghcr.io/pantsbuild/wheel_build_aarch64:v3-8384c5cf>  bash
> curl -L <https://cnn.com>  # blocks forever
(github creates and uses a bridge network in the same way when you run an action in a container)
w

witty-crayon-22786

05/12/2023, 4:23 PM
hm. so,
docker
was recently upgraded on that machine, which was what caused it to lose
docker-init
, and need another package installed to be able to access
<http://docker.io|docker.io>
h

happy-kitchen-89482

05/12/2023, 4:23 PM
seems related
did you reboot after that installation?
w

witty-crayon-22786

05/12/2023, 4:23 PM
i did not, no.
h

happy-kitchen-89482

05/12/2023, 4:24 PM
I'll give that a try
Hopefully the changes I made a few days ago will allow it to come back clean from the reboot
it did come up, but the reboot did not fix
Looks like the docker daemon is not creating the relevant iptables entries for the network, even though it should be doing so by default and its config file has not turned that off afaict
w

witty-crayon-22786

05/12/2023, 5:19 PM
hm… actually. it looks like whatever you did recently resulted in the
<http://docker.io|docker.io>
package getting removed again as well.
h

happy-kitchen-89482

05/12/2023, 5:19 PM
Ugh maybe it's more than that, manually creating the MASQUERADE nat rule doesn't fix
WTF?
All I did was reboot
how did you install docker.io originally?
w

witty-crayon-22786

05/12/2023, 5:20 PM
apt install <http://docker.io|docker.io>
h

happy-kitchen-89482

05/12/2023, 5:21 PM
It is installed, AFAICT
Did you just reinstall it?
w

witty-crayon-22786

05/12/2023, 5:21 PM
hm. a shard that i just ran failed with the error we were getting before it was installed: https://github.com/pantsbuild/pants/actions/runs/4961310945/jobs/8877946260?pr=18995
apt-file search docker-init
suggested that that file came from
<http://docker.io|docker.io>
but perhaps they moved it again…
Copy code
apt-file update && apt-file search docker-init
h

happy-kitchen-89482

05/12/2023, 5:23 PM
that file is present
w

witty-crayon-22786

05/12/2023, 5:28 PM
hm. yes.
and there is only one ARM worker?
h

happy-kitchen-89482

05/12/2023, 5:28 PM
Yep, it's that huge beefy machine with 80-something cores
w

witty-crayon-22786

05/12/2023, 5:49 PM
i’ve managed to repro the
docker-init
issue using the docker CLI, so i think that it might be a docker daemon configuration issue
@happy-kitchen-89482: have you been editing the daemon’s config?
h

happy-kitchen-89482

05/12/2023, 5:52 PM
Nope
OK, I think the iptables thing is a red herring, looks like when using the custom bridge network the
nameserver
in the container's
/etc/resolv.conf
is wrong
w

witty-crayon-22786

05/12/2023, 6:09 PM
@happy-kitchen-89482: i think that i need to edit
/lib/systemd/system/docker.service
and restart
dockerd
to fix this… is now a reasonable time?
h

happy-kitchen-89482

05/12/2023, 6:10 PM
To fix which part?
the docker-init part?
Sure, go ahead
I've filed a ticket with equinix, because something is very off with the docker daemon and networking
w

witty-crayon-22786

05/12/2023, 6:11 PM
the
docker-init
issue. basically: the daemon is expecting
docker-init
on its PATH, but it isn’t there… it defaults to looking for it at
/usr/libexec/docker-init
, but
dockerd
can be instructed to look for it elsewhere with
dockerd --init-path
. but… wth.
h

happy-kitchen-89482

05/12/2023, 6:11 PM
it's not modifying the iptables AND it's not setting the nameserver correctly in the container
I think the docker daemon on that machine is just effed
w

witty-crayon-22786

05/12/2023, 6:12 PM
yea. seems like a botched package or OS upgrade
h

happy-kitchen-89482

05/12/2023, 6:12 PM
so go ahead, I'm waiting for their support to chime in, I'm out of ideas
So that was a fun 5 hours of linux network debugging, takes me back to my Checkpoint days
😓 1
w

witty-crayon-22786

05/12/2023, 6:13 PM
yea. back to the time before “immutable infrastructure”
uh… there are two `dockerd`s:
Copy code
# ps -ef | grep dockerd
root        2505       1  0 16:28 ?        00:00:24 dockerd --group docker --exec-root=/run/snap.docker --data-root=/var/snap/docker/common/var-lib-docker --pidfile=/run/snap.docker/docker.pid --config-file=/var/snap/docker/2857/config/daemon.json
root       97141       1  0 18:16 ?        00:00:01 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock --init-path /usr/bin/docker-init
… that doesn’t seem good?
🤯 1
h

happy-kitchen-89482

05/12/2023, 6:25 PM
yeesh
That definitely seems bad
I guess we kill the snap one?
w

witty-crayon-22786

05/12/2023, 6:26 PM
yea… we have two copies. i wonder if that’s my fault due to attempting to install
<http://docker.io|docker.io>
with
apt
yesterday.
h

happy-kitchen-89482

05/12/2023, 6:27 PM
Oh, maybe
grrr
Have you uninstalled /usr/bin/docker{d}?
w

witty-crayon-22786

05/12/2023, 6:30 PM
i ran
apt remove <http://docker.io|docker.io>
, so only the
snap
version should be installed now.
i don’t know where it actually lives, or how it launches.
but the
docker
cli is still there and still connects
h

happy-kitchen-89482

05/12/2023, 6:31 PM
Different docker cli
/snap/bin/docker
instead of
/usr/bin/docker
Still seeing the network issues, but it might be because configs got tangled up or whatever
I will reboot again, just to reset some state
👍 1
w

witty-crayon-22786

05/12/2023, 6:34 PM
and i’m still seeing the
docker-init
issue, but now i know that my issue is with
snap
and not with
apt
=(
sure.
h

happy-kitchen-89482

05/12/2023, 6:38 PM
woohoo!
networking issue appears to be solved
gonna try something
OK, https://github.com/pantsbuild/pants/actions/runs/4954317612/jobs/8879527672 is now past the network problems and actually building wheels
w

witty-crayon-22786

05/12/2023, 7:25 PM
huzzah
i’m losing my mind on this
docker-init
issue though. have no idea what should be installing it, as searching inside of
snap
packages is apparently not a thing
@happy-kitchen-89482: if i were to remove the
snap
version and swap to the
apt
version, would it re-break your networking fix?
i’ve burned enough time on this today to consider just disabling those tests on ARM. looking at how to do that too.
h

happy-kitchen-89482

05/12/2023, 8:00 PM
I didn't do anything to fix networking other than reboot after killing the apt dockerd
The two dockerds were stepping on each others toes in the iptables, looks like
so I think switching to apt is probably fine
w

witty-crayon-22786

05/12/2023, 8:01 PM
ok.
that would be my preference, vs cherry-picking test skips everywhere. will try it.
yea,
snap remove docker && apt install <http://docker.io|docker.io>
worked. ffs.
i’m sure that it broke networking, but.
omg. now everything else is working i’m seeing a very frequent “bus error” on the ARM shards. this is driving me nuts.
@happy-kitchen-89482: do you do anything in particular to restart the machine, or just
reboot
?
h

happy-kitchen-89482

05/12/2023, 10:28 PM
I reboot from the equinix web console, but I assume
reboot
works just as well
w

witty-crayon-22786

05/12/2023, 10:29 PM
yea, it came back up fine afaict. didn’t help with the bus error unfortunately.
i’m doing trying for the day.
h

happy-kitchen-89482

05/12/2023, 11:56 PM
Yeah, I'm done debugging docker on ARM for today