has anyone figured out a fix for this issue beside restartin Pants #development

has anyone figured out a fix for this issue beside...

early-needle-54791

01/08/2020, 10:38 PM

has anyone figured out a fix for this issue beside restarting your machine? I’ve run into it a few times in the last two days:

Exception: Could not initialize store for process cache: "Error making env for store at \"/private/var/folders/cf/z3ktw6dn467gvm24fgx8ft6c0000gp/T/tmpk1bj57ql/processes/7\": No space left on device"

early-needle-54791

01/08/2020, 10:38 PM

Which is more frequently than usual

hundreds-father-404

01/08/2020, 10:39 PM

yes @hundreds-breakfast-49010 has been working on a fix, but is blocked by us having no idea what’s causing this one specific CI failure. https://github.com/pantsbuild/pants/pull/8892 We’d love your eyes on it if you have some time

early-needle-54791

01/08/2020, 10:39 PM

thanks for the context. I’ll take a look if I have some cycles this week.

hundreds-breakfast-49010

01/08/2020, 10:40 PM

Yeah I would love to merge this commit but I am not sure what's making CI not pass with this

witty-crayon-22786

01/08/2020, 10:41 PM

as daniel pointed out on that PR, it doesn't seem like a fix... the bug appears to be around not cleaning things up in tmp

witty-crayon-22786

01/08/2020, 10:41 PM

(certainly the need to restart your machine is)

witty-crayon-22786

01/08/2020, 10:43 PM

(sorry: i know i'm not contributing in a useful way here)

hundreds-breakfast-49010

01/08/2020, 10:44 PM

@witty-crayon-22786 yeah it's a mitigation, not a fix

hundreds-breakfast-49010

01/08/2020, 10:45 PM

the problem with temp cleanup not happening is that we can't reliably make sure rust destructors get run if the program fails when in python code

hundreds-breakfast-49010

01/08/2020, 10:45 PM

which is a separate issue that in principle affects more things besides tmp cleanup

witty-crayon-22786

01/08/2020, 10:45 PM

yea, true. but that is only if we segfault

witty-crayon-22786

01/08/2020, 10:46 PM

and i don't think folks are segfaulting left and right? ... if we are, that would be important to figure out

early-needle-54791

01/08/2020, 10:48 PM

Nothing abnormal happened to me while testing between now and the last time I fixed this by restarting

early-needle-54791

01/08/2020, 10:48 PM

well nothing that looked abnormal on the console

hundreds-breakfast-49010

01/08/2020, 10:51 PM

@witty-crayon-22786 I don't think it's just segfaults, it's any time execution crosses FFI into python between when a rust object is created and when the program exits

hundreds-breakfast-49010

01/08/2020, 10:51 PM

I'm not 100% sure about this, but that seems consistent with the behavior I've been seeing

witty-crayon-22786

01/08/2020, 10:52 PM

that... doesn't seem likely.

witty-crayon-22786

01/08/2020, 10:53 PM

it would have to prevent

Drop

in a completely separate part of the code somehow

witty-crayon-22786

01/08/2020, 10:55 PM

and in a very particular part of the code: the process execution codepath, which doesn't touch python objects

hundreds-breakfast-49010

01/08/2020, 11:02 PM

hm, this reminds me that in my own replications ofthe running out of disk space thing, I saw an error that looked like it was coming from pthreads based on googling

hundreds-breakfast-49010

01/08/2020, 11:03 PM

so, maybe what's happening is that if python throws an exception in the main thread (which is what happens in hte out of disk space case), a separate thread doing process execution stuff gets aborted

hundreds-breakfast-49010

01/08/2020, 11:03 PM

and in that case destructors would not be run

hundreds-breakfast-49010

01/08/2020, 11:04 PM

so maybe that's the real underlying issue

hundreds-breakfast-49010

01/08/2020, 11:04 PM

but of course here the reason pants is throwing an exception is precisely becuase it's run out of disk space

witty-crayon-22786

01/08/2020, 11:06 PM

so, maybe what's happening is that if python throws an exception in the main thread (which is what happens in hte out of disk space case), a separate thread doing process execution stuff gets aborted

python is not producing that exception: rust is

witty-crayon-22786

01/08/2020, 11:07 PM

here https://github.com/pantsbuild/pants/blob/master/src/rust/engine/sharded_lmdb/src/lib.rs#L161

hundreds-breakfast-49010

01/08/2020, 11:08 PM

I don't recall seeing that error message when I was testing locally on my system

hundreds-breakfast-49010

01/08/2020, 11:08 PM

what I was seeing was what looked like an io error message string from rust

hundreds-breakfast-49010

01/08/2020, 11:08 PM

about running out of disk space

hundreds-breakfast-49010

01/08/2020, 11:08 PM

getting propagated up to

hundreds-breakfast-49010

01/08/2020, 11:09 PM

https://github.com/pantsbuild/pants/blob/master/src/python/pants/engine/scheduler.py#L471

witty-crayon-22786

01/08/2020, 11:09 PM

sure

witty-crayon-22786

01/08/2020, 11:10 PM

my point is: python isn't encountering the exception: rust is

hundreds-breakfast-49010

01/08/2020, 11:10 PM

I wonder if maybe what you're seeing on your machine and what I'm seeing on my machine are different

hundreds-breakfast-49010

01/08/2020, 11:11 PM

it's been a few days since I was replicating this, but I don't remember ever seeing that error message from sharded_lmdb

early-needle-54791

01/08/2020, 11:11 PM

possibly. That error was coming from python when it was trying to init the scheduler

early-needle-54791

01/08/2020, 11:11 PM

I restarted so I think I lost the rest of the log

hundreds-breakfast-49010

01/08/2020, 11:11 PM

oh wait, that's the error @early-needle-54791 was seeing

hundreds-breakfast-49010

01/08/2020, 11:12 PM

it's possible that there are multiple places where rust could try to do some io operation and run into the same out of disk space error, which would yield the same error message string in an

Err

result

hundreds-breakfast-49010

01/08/2020, 11:13 PM

but that's not going to cause the program itself to crash until that

Err

result gets passed over the FFI boundary as a python exception and then actually raised in python without being caught in python

early-needle-54791

01/08/2020, 11:14 PM

from what I can tell. python is initializing the engine, which fails, and then python raises that exception, which comes from an error that has been cast into a python exception type, or maybe just a string handle?

early-needle-54791

01/08/2020, 11:14 PM

Thats how it showed up this time.

early-needle-54791

01/08/2020, 11:15 PM

I could only find one instance of native_engine.so in the temp directory that is was complaining had no space.

early-needle-54791

01/08/2020, 11:15 PM

I removed it but the issue didn’t go away, so I restarted

early-needle-54791

01/08/2020, 11:16 PM

disk usage didn’t seems particularly high, but I’m not all that familiar with osx partitioning

witty-crayon-22786

01/08/2020, 11:16 PM

so, the "out of space" error is independent from what causes the out of space error: especially since it requires a restart to resolve.

witty-crayon-22786

01/08/2020, 11:16 PM

once you're out, all sorts of things could fail.

hundreds-breakfast-49010

01/08/2020, 11:19 PM

on linux I'm able to recover from the error by clearing out a bunch of flies in /tmp

hundreds-breakfast-49010

01/08/2020, 11:19 PM

but that doesn't seem to be possible on a mac

early-needle-54791

01/08/2020, 11:20 PM

agreed. Not sure how to do the equivalent on mac

gentle-wolf-58752

01/09/2020, 6:41 PM

FWIW this is something I’ve come across regularly as of late, typically hitting the error (out of space) after 3-5 runs of

./pants --no-v1 --v2 test

that are only resolved after a reboot. If there is any further digging / context I can provide to help diagnose this (for macs - I’m on a MBA with the latest macOS release) please let me know.

witty-crayon-22786

01/09/2020, 6:43 PM

and the space is all taken up inside

/tmp

witty-crayon-22786

01/09/2020, 6:44 PM

i can try to take a look tonight. but want to be sure of the repro and what to look for

witty-crayon-22786

01/09/2020, 6:52 PM

@gentle-wolf-58752: which target in particular were you iterating on?

gentle-wolf-58752

01/09/2020, 6:52 PM

it’s using

/private/var/folders/fv/<...>

on mac, errors look like

Copy code

Exception: Could not initialize store for process cache: "Error making env for store at \"/private/var/folders/fv/9ly0y0dj1g946plb36nm4s1m0000gn/T/tmpv8o2rfya/processes/8\": No space left on device"

Checking that dir after the run fails shows the referenced dirs not there

gentle-wolf-58752

01/09/2020, 6:52 PM

I just repro’d using:

./pants --no-v1 --v2 test src/python/pants/rules/core:tests --test-debug

witty-crayon-22786

01/09/2020, 7:02 PM

ok. yea, i repro after a few runs of roughly that command. but there is no change in

tmp

size before and after.

witty-crayon-22786

01/09/2020, 7:02 PM

Screen Shot 2020-01-09 at 10.53.00 AM.png,Screen Shot 2020-01-09 at 11.01.15 AM.png

early-needle-54791

01/09/2020, 7:02 PM

Which is what I was seeing also

witty-crayon-22786

01/09/2020, 7:03 PM

which makes me suspect virtual memory rather than actual files.

witty-crayon-22786

01/09/2020, 7:07 PM

and lmdb itself.

witty-crayon-22786

01/09/2020, 8:29 PM

would folks mind cherry-picking https://github.com/pantsbuild/pants/pull/8933 and seeing whether it fixes this issue for them? (after a restart)

❤️ 2

hundreds-breakfast-49010

01/09/2020, 8:58 PM

if this fixes the problem, do we still need to figure out the problem with https://github.com/pantsbuild/pants/pull/8892 ?

witty-crayon-22786

01/09/2020, 9:03 PM

Imo, no... Unless we think that the

whl

size improvement is worth it

gentle-wolf-58752

01/09/2020, 9:05 PM

#8933 looks to fix the issue for me! i’ve done ~10 test runs so far without issue

💯 2

gentle-wolf-58752

01/09/2020, 9:07 PM

thanks!

Open in Slack

Previous Next