thread for the docker mount point race: <https://g...
# development
w
thread for the docker mount point race: https://github.com/pantsbuild/pants/issues/18162
the summary of the problem is essentially that processes being
exec
’d inside an existing container may not be able to observe all of the inputs in their sandboxes (the parent directory of which is mounted once per-container)
a few potential mitigations: • don’t (re)use cached containers ◦ on my older macOS machine, it takes 750ms to start a relatively slim image, so this is probably a non-starter. • dynamically add / remove bind mounts the running container ◦ probably feasible, but only helps address the problem if the filesystem consistency issue would not exhibit for bind mounts created after all of the input files exist… which might not be the case if the issue is actually in some code that is running outside of the container not being synchronized • tar-pipeing process inputs into the container, rather than using bind mounts ◦ would prevent use of immutable inputs, which would be annoying, and probably not be quite as efficient …?
f
there is also copying files in to and out of a docker volume
and mount the docker volume into the cached container
(versus using a bind mount)
I saw that as a potential solution; unclear whether it would work for this use case though without some further thought.
longer term, mount an NFSv4 volume into the container and then have pantsd export a NFSv4 server hosting input roots.
w
there is also copying files in to and out of a docker volume
this is essentially the tar-pipe idea… more complicated “copy files into a container” cases seem to suggest tar-pipes. there doesn’t appear to be a way to create a volume “from scratch” containing content.
longer term, mount an NFSv4 volume into the container and then have pantsd export a NFSv4 server hosting input roots.
yea… would be generally useful too.
f
well you could spin up a busybox container to send files in/out of the volume, busybox should spin up pretty fast
and there could be an interesting volume driver in this list: https://docs.docker.com/engine/extend/legacy_plugins/#volume-plugins
w
i think volumes are only useful if we can mount them dynamically on a running container, which sounds like maybe a thing?
f
I don't recall
as an aside, buildbarn comes with a NFSv4 server backed by CAS now
👍 1
w
cc @curved-television-6568, @enough-analyst-54434: anything else jump out at you?
e
Well, IIUC the problem is not even understood yet. The ticket mentions races, but that requires - I don't know, async filesystems? Windows WSL uses plan9fs which is a network filesystem and very slow to transfer files / edits and does appear async. It would really help to have the theory of what the problem actually is nailed down in the ticket. At that point I could take a look. I don;t even have a clue how we use the exec. Is it write to a volume mount, then serially exec into a container that can see that volume mount? I'd guess so, but I'm just guessing, etc.
If you knew more about the filesystems and what they actually guaranteed you could write the files, then write a token file, then in the exec, use a shim binary that blocked on appearance of the token. If the FS was async but guaranteed order, that would work.
There are just a ton of things it seems like you could do, but knowing the real problem would really help pick.
w
Is it write to a volume mount, then serially exec into a container that can see that volume mount? I’d guess so, but I’m just guessing, etc.
write to a directory that is bind-mounted into the container. on Linux the mounts are synchronous since there is no virtualization involved (uses overlay2, etc), but on macOS it has used a variety of strategies in the last few years (osxfs, grpc-fuse, virtiofs)
e
Yeah, the latter part you mention in the ticket, but with no strong conclusion that those ~3 are async.
w
it is entirely possible that
virtiofs
plus macOS’s new virtualization framework (which is still an experimental Docker feature) will resolve this
@enough-analyst-54434: well, it stems from osxfs having had lots of settings around consistency levels, which disappeared in grpc-fuse because “they’re not necessary”, but…
e
I guess we have infini-slack now, but this seems like a real bad format for gathering a shared holisitc view of the question to even begin to answer,
Plowing on though, your bullets above mention 750ms mac, I did spend a good bit of effort not using Docker and getting 100ms starts in ~2018/2019 or so. I used crun + jq
Is ditching Docker an option?
w
If you knew more about the filesystems and what they actually guaranteed you could write the files, then write a token file, then in the exec, use a shim binary that blocked on appearance of the token. If the FS was async but guaranteed order, that would work.
mm, yea: this is a good thought too. a particularly annoying bit about both
osxfs
and
grpc-fuse
is that they are closed source. and grpc-fuse is deprecating
osxfs
, but has drastically less documentation than
osxfs
did.
e
The whole Mac WIndows thing is just super dumb - I agree on that. But I guess we have to support these super weird hoops for non macOS / Windows devs who use them anyhow.
w
Is ditching Docker an option?
maybe, yea. but afaict, all of the macOS-containers systems have this same issue, and have been bouncing between implementations. they seem to be trending toward virtiofs. the first half of this article is good: https://www.cncf.io/blog/2023/02/02/docker-on-macos-is-slow-and-how-to-fix-it/
Plowing on though, your bullets above mention 750ms mac, I did spend a good bit of effort not using Docker and getting 100ms starts in ~2018/2019 or so. I used crun + jq
i fully expect that this is macOS doing virtualization rather than docker itself… latency is much lower on Linux, afaik
e
Well that's future so we can't just wait. AFAICT your send a tar by network idea - presumably you need a shim binary to coordinate? - sounds like the most sane option when not fully understanding the problem.
Because clearly that can work
w
re: your token and filesystem ordering idea… it’s pretty cheap to try. i just don’t have a bulletproof repro, so it might mean more 2.15.x rcs.
e
I personally would not try until I understood the problem.
w
Well that’s future so we can’t just wait. AFAICT your send a tar by network idea - presumably you need a shim binary to coordinate? - sounds like the most sane option when not fully understanding the problem.
no, no need for a shim binary: you can pipe directly into tar. the docker API allows for streaming stdin to a process
e
IUt banks on a guess and passing the guess would itself be a guess
w
I personally would not try until I understood the problem.
see above re: grpc-fuse being closed source… i’m not sure i will actually get an answer there
e
Exactly
i fully expect that this is macOS doing virtualization rather than docker itself… latency is much lower on Linux, afaik
I was bringing - wild variablity - 500ms - 1s down to 100ms constant FWIW.
So it helped on Linux too.
👍 1
w
virtiofs is open source (or at least portions of it are), so if it eventually becomes the default (which doesn’t sound guaranteed based on the trend of this thread: https://github.com/docker/roadmap/issues/7), then studying its implementation is an option
e
It sounds to me like 2.15.x maybe needs to be made independent of this stuff? Releases have been dragging out lately - maybe holiday bias - but whatever path here it sounds like ~major surgery for an rc5
w
it’s major surgery, but in the brand new headline feature of the release. i’m less worried about the magnitude of the change, and more worried about delaying things
…or not delaying them enough, i suppose
e
"brand new headline feature of the release"
Ok, that sounds like marketing speak to me.
w
yea, agreed.
e
We don;t really do feature releases IIUC. Although we also do - we now blog about things, its a bit markety
w
yea, it’s tricky.
the feature is marked experimental, which is supposed to be another way to disconnect releases from feature stability
but … i still want to try to meet some quality bar
e
I guess as long as we're honest in the marketing, i.e don;t for this release or do but note its broken for Mac
h
We can change the headline, FWIW
w
tar pipe approach is still feasible, but more challenging than i initially though, since without the bind mount it would need to be bidirectional: a pipe in for inputs, and a pipe out for outputs.
c
cc @curved-television-6568, @enough-analyst-54434: anything else jump out at you?
nothing obvious, no.
w
https://github.com/pantsbuild/pants/pull/18225 … using a tar pipe is about 30% slower than using a mount.
that makes me wonder whether a better approach would be to conditionally disable container caching instead… because at least container start time is a relatively fixed overhead.
…i think that i’m going to bang out a flag to disable the container cache as well, and then we can land the docker-strategy flag as a trinary option.
nevermind… not worth the effort yet. but i’ll leave space in the option name for additional strategies in the future.
argh! symlink support in Digests doesn’t exist in 2.15.x, so a significant portion of the infrastructure for https://github.com/pantsbuild/pants/pull/18225 isn’t available, and it can’t be picked cleanly. i’m going to revert it from
main
, and land something to disable the container cache instead. what a mess.
i’ve gone down the path of introspecting the files in the container to try and wait until they have been created. the “wait for a single file written after all other inputs were written” approach (the token approach that John suggested) was not successful, so i proceeded to exec’ing
stat
for all of the inputs.
(Tom’s repro was invaluable: thanks Tom)
the wild thing right now is that
stat
shows that all input files for a task exist, including files which the process claims don’t exist: https://gist.github.com/stuhood/6c78ab4a9e511b14df2c3247962604ee
i.e.
__pkgs__/internal_reflectlite/__pkg__.a
is claimed missing, but
stat
from within the container shows it as existing.
one unknown from looking at this though is that not all of these
.a
files have the same permissions… which is fishy, but i don’t know how it could result in a “no such file or directory” error
e
Add to the fishy is the access / modify / change timestamp set. Very different from all others.
Seems like a good thing to drill on.
Also a super big file.
w
yea… that seems like it could have something to do with the Link count. large files are now using hardlinks on
main
(but not on
2.15.x
… sigh)
that probably explains the permissions difference too.
e
Is this just a dangling symlink in a tar then? Not fully following. The size stat says no.
w
no tars here anymore: this is a bind mount, with a real file behind it
e
Also claims regular file.
Ah, gotcha.
I mean ... stat is just inode IIUC, don;t know impl. If you had a perverse file system that flushed inode metadata before data blocks all hit disk and further reported not all data bllocks there as missing file ... crazy - not it.
w
…true. sheesh. i suppose i could try opening everything.
e
So ... can you remind me whether or not injecting a shim binary in the image is acceptable / whether local remexec can be used instead - with internal remexec binaries / mini cluster?
Obviously big change of mechanism.
w
it would be acceptable… but i’m not sure it is in scope
e
Well, everything is broken so scope may have to creep to ship unbroken software for Mac - but yeah - definitely bigger scope.
f
You could also consider a shim binary not for REAPI but just for un-tar'ing a input root sent over
an in-container Pants supervisor process
w
that’s what the
tar-pipe
change did.
supervisor was
tar
, heh
f
right, but you can run the in-container process all the time to avoid having to
docker exec
it
more of an optimization on the technique really
separately, https://virtio-fs.gitlab.io/ is experimental in docker for macOS 4.6 as a replacement for gRPC-FUSE
w
yea. unfortunately, i’d need to upgrade my macOS to use it. @fast-nail-55400: are you able to try it and see whether it repros?
e
Totally left field is a custom volume driver. But I don't know the API or if it could be warped for this sort of thing.
Say that did work @witty-crayon-22786 is asking others to do so on the table?
w
absolutely: for an experimental feature, absolutely.
e
Well, think ahead
How long must that block?
experimental forever?
w
the tar-pipe solution will be viable in 2.16.x (although still slower presumably)
e
I thought that was not a solution though? output files?
f
I'll give the virtofs a try
w
thank you
@enough-analyst-54434: the reason tar-pipe was reverted is that it can’t be cherry-picked to 2.15.x… not (necessarily) because it didn’t work. i didn’t have Tom’s repro at the point when i merged it
e
Ok. We did have a known output files bug waiting though also IIUC
w
unconfirmed though… gedanken, if you will.
because Tom’s repro case is very good… you get a repro basically every time. but never for outputs
e
Ok. I am super uncomfortable with gedanken + promo-flash-sale but I can back off that. I'm a bit alone there I think.
f
virtiofs seems to help. I didn't see a failure from missing files. (Although now I see a Go-specific error from the specific test which happens on subsequent invocations so probably not the race condition.)
Copy code
ProcessExecutionFailure: Process 'Link Go binary: ./package_analyzer' failed with exit code 1.
stdout:
loadinternal: cannot find runtime/cgo

stderr:
2023/02/15 22:10:13 reference to undefined builtin "runtime.duffzero" from package "runtime"
w
hmmmm
can you … try a few times? i.e. with
--no-local-cache
?
f
sure
ok just happens less frequently
got:
/usr/local/go/src/syscall/env_unix.go:12:2: could not import runtime (open __pkgs__/runtime/__pkg__.a: no such file or directory)
w
thank you. and this is most recent macOS with virtiofs?
f
macOS 13.2.1 (22D68)
Docker 4.16.2 (95914)
x86
ran 3 times, 1 failed due to missing files, the other two failed due to a Go backend bug with environments likely
(
./pants_from_sources --no-local-cache package race:racy_docker
)
👍 1
scie-pants 0.5.1
w
an unfortunate update from the filesystem probing front is that John’s hunch about
stat
vs
open
was partially correct:
wc -c
will report “No such file or directory” in some cases, and then eventually stabilize on agreeing that all files exist. but then
go
will still fail to find some inputs.
e
My god.
yea. i’m about out of rope here.
…actually. two more things to guess and check at: 1. disabling hardlinking, 2. trying mounting sandboxes in a non-
tmp
filesystem.
@fast-nail-55400: thanks again for trying that.
👍 1
f
an alternative to Docker Desktop on MacOS is Lima: https://github.com/lima-vm/lima
(and it includes containerd support)
another alternative is https://github.com/abiosoft/colima
w
afaik, they use virtiofs as well now…? but who knows where the consistency disconnect is
sonofagun. no repro with hardlinking disabled. but WTH… we don’t hardlink on
2.15.x
, so i don’t understand how folks were seeing this there.
!? … @enough-analyst-54434: can you imagine there having been hardlinks involved in the original report: https://github.com/pantsbuild/pants/issues/18162 …? perhaps something to do with how PEX invokes itself?
yea, confirmed… no repro across 3 or 4 runs of the
go
case with hardlinks disabled (by raising this limit), and no extra filesystem synchronization. so main question now is: how did someone observe this in 2.15.x, given that we hadn’t started hardlinking things at that point
even if we can’t really imagine a case for PEX with hardlinks, i had been considering moving the docker
named_caches
into a volume anyway, so might be willing to guess and check there
e
Pex will hardlink files under certain conditions from the Pex cache, but not the venv PEX
pex
script. The original issue has not enough info to determine what
./pex
is afaict. Is that the Pex PEX? If so, that's a PEX file materialized by the engine.
w
yes: it’s the pex-pex: there is an expander in the description with more info
e
Ok, well what concoction do you have in mind? PEX hasn't even executed yet, right?
./pex
is not found
w
i’m looking in
src/python/pants/backend/python/util_rules/pex_cli.py
now, but… we must be invoking it as
python ./pex
…?
e
Yup
So, I think that's a fish and a miss
w
i’m not sure yet… the error looks pretty different on my machine, so i wonder whether there are wrappers in there
Copy code
$ python3 ./does-not-exist
/Users/stuhood/.pyenv/versions/3.6.10/bin/python3: can't open file './does-not-exist': [Errno 2] No such file or directory
i mean, PEX is itself a venv PEX, right? and it will re-exec?
e
I honestly don't remember.
But re-exec happens for zipapp too. So not the right question really.
I'm headed out for a few hours.
w
see you later
thanks for talking this through
…you know what, the original file would have to exist. otherwise, how would it get into a “<frozen zipimport>” codepath?
this is not definitive, but it feels like enough for me to justify making the named cache a volume rather than a bind mount, since i had already been looking at that.
…ah!!! the original report was in
2.16.0.dev5
! damnit
let me look at the other one.
it was on
main
as well! so that’s it then. holy crap.
e
One thing I'm lost on still is what it is. Turning hard links off fixes, but why does having them break? Have you sussed that @witty-crayon-22786?
w
My guess is that the virtual filesystems in play here don't handle them well. But no real idea why. Will probably open an issue with docker if I can isolate it further.
e
Ok. One thing to note about PEX hardlinking is it's all small stuff save for the odd mega-.so