has anyone figured out a fix for this issue beside...
# development
e
has anyone figured out a fix for this issue beside restarting your machine? I’ve run into it a few times in the last two days:
Exception: Could not initialize store for process cache: "Error making env for store at \"/private/var/folders/cf/z3ktw6dn467gvm24fgx8ft6c0000gp/T/tmpk1bj57ql/processes/7\": No space left on device"
Which is more frequently than usual
h
yes @hundreds-breakfast-49010 has been working on a fix, but is blocked by us having no idea what’s causing this one specific CI failure. https://github.com/pantsbuild/pants/pull/8892 We’d love your eyes on it if you have some time
e
thanks for the context. I’ll take a look if I have some cycles this week.
h
Yeah I would love to merge this commit but I am not sure what's making CI not pass with this
w
as daniel pointed out on that PR, it doesn't seem like a fix... the bug appears to be around not cleaning things up in tmp
(certainly the need to restart your machine is)
(sorry: i know i'm not contributing in a useful way here)
h
@witty-crayon-22786 yeah it's a mitigation, not a fix
the problem with temp cleanup not happening is that we can't reliably make sure rust destructors get run if the program fails when in python code
which is a separate issue that in principle affects more things besides tmp cleanup
w
yea, true. but that is only if we segfault
and i don't think folks are segfaulting left and right? ... if we are, that would be important to figure out
e
Nothing abnormal happened to me while testing between now and the last time I fixed this by restarting
well nothing that looked abnormal on the console
h
@witty-crayon-22786 I don't think it's just segfaults, it's any time execution crosses FFI into python between when a rust object is created and when the program exits
I'm not 100% sure about this, but that seems consistent with the behavior I've been seeing
w
that... doesn't seem likely.
it would have to prevent
Drop
in a completely separate part of the code somehow
and in a very particular part of the code: the process execution codepath, which doesn't touch python objects
h
hm, this reminds me that in my own replications ofthe running out of disk space thing, I saw an error that looked like it was coming from pthreads based on googling
so, maybe what's happening is that if python throws an exception in the main thread (which is what happens in hte out of disk space case), a separate thread doing process execution stuff gets aborted
and in that case destructors would not be run
so maybe that's the real underlying issue
but of course here the reason pants is throwing an exception is precisely becuase it's run out of disk space
w
so, maybe what's happening is that if python throws an exception in the main thread (which is what happens in hte out of disk space case), a separate thread doing process execution stuff gets aborted
python is not producing that exception: rust is
h
I don't recall seeing that error message when I was testing locally on my system
what I was seeing was what looked like an io error message string from rust
about running out of disk space
getting propagated up to
w
sure
my point is: python isn't encountering the exception: rust is
h
I wonder if maybe what you're seeing on your machine and what I'm seeing on my machine are different
it's been a few days since I was replicating this, but I don't remember ever seeing that error message from sharded_lmdb
e
possibly. That error was coming from python when it was trying to init the scheduler
I restarted so I think I lost the rest of the log
h
oh wait, that's the error @early-needle-54791 was seeing
it's possible that there are multiple places where rust could try to do some io operation and run into the same out of disk space error, which would yield the same error message string in an
Err
result
but that's not going to cause the program itself to crash until that
Err
result gets passed over the FFI boundary as a python exception and then actually raised in python without being caught in python
e
from what I can tell. python is initializing the engine, which fails, and then python raises that exception, which comes from an error that has been cast into a python exception type, or maybe just a string handle?
Thats how it showed up this time.
I could only find one instance of native_engine.so in the temp directory that is was complaining had no space.
I removed it but the issue didn’t go away, so I restarted
disk usage didn’t seems particularly high, but I’m not all that familiar with osx partitioning
w
so, the "out of space" error is independent from what causes the out of space error: especially since it requires a restart to resolve.
once you're out, all sorts of things could fail.
h
on linux I'm able to recover from the error by clearing out a bunch of flies in /tmp
but that doesn't seem to be possible on a mac
e
agreed. Not sure how to do the equivalent on mac
g
FWIW this is something I’ve come across regularly as of late, typically hitting the error (out of space) after 3-5 runs of
./pants --no-v1 --v2 test
that are only resolved after a reboot. If there is any further digging / context I can provide to help diagnose this (for macs - I’m on a MBA with the latest macOS release) please let me know.
w
and the space is all taken up inside
/tmp
?
i can try to take a look tonight. but want to be sure of the repro and what to look for
@gentle-wolf-58752: which target in particular were you iterating on?
g
it’s using
/private/var/folders/fv/<...>
on mac, errors look like
Copy code
Exception: Could not initialize store for process cache: "Error making env for store at \"/private/var/folders/fv/9ly0y0dj1g946plb36nm4s1m0000gn/T/tmpv8o2rfya/processes/8\": No space left on device"
Checking that dir after the run fails shows the referenced dirs not there
I just repro’d using:
./pants --no-v1 --v2 test src/python/pants/rules/core:tests --test-debug
w
ok. yea, i repro after a few runs of roughly that command. but there is no change in
tmp
size before and after.
Screen Shot 2020-01-09 at 10.53.00 AM.png,Screen Shot 2020-01-09 at 11.01.15 AM.png
e
Which is what I was seeing also
w
which makes me suspect virtual memory rather than actual files.
and lmdb itself.
would folks mind cherry-picking https://github.com/pantsbuild/pants/pull/8933 and seeing whether it fixes this issue for them? (after a restart)
❤️ 2
h
if this fixes the problem, do we still need to figure out the problem with https://github.com/pantsbuild/pants/pull/8892 ?
w
Imo, no... Unless we think that the
whl
size improvement is worth it
g
#8933 looks to fix the issue for me! i’ve done ~10 test runs so far without issue
💯 2
thanks!