Hello! I’m constantly running into this issue for ...
# development
r
Hello! I’m constantly running into this issue for the last couple of days, if anyone has any insight: When I run
./pants test tests/python/pants_test/backend/project_info/tasks:test_export_dep_as_jar
which has 10-15 unit tests, at some point I start getting
Exception: Could not initialize store for process cache: "Error making env for store at \"/private/var/folders/rm/zwytnntn5013gslf3xlv31rc0000gp/T/tmp6y9cz45d/processes/7\": No space left on device
in all the tests, which can only be solve by restarting. Until now, when I just restarted but the problem persisted. Things that I’ve checked: • I have plenty of memory left, and I’m not swapping. • The volume has enough space and inodes left. • There are no deleted files held by processes. Any ideas?
😢 1
h
@dry-analyst-73584 has been running into this problem frequently too
Could it be a space leak by the engine?
r
But I just restarted, and the disk has enough space left
🤔 1
w
"space" here being virtual memory.
👍 1
LMDB is very virtual memory heavy.
but unless there are orphaned processes, it shouldn't be held.
r
That would be checked by a good ol’
ps aux | grep pants
, right?
w
@red-balloon-89377: are you able to try with 998ffbb7c86ca33a2156b002a9f18eb6f1e425b8 reverted? ... might not be a clean revert, unfortunately.
👍 1
h
Also you’re using the V1 test runner and getting this? Huh. I think I’ve only gotten it using the V2 test runner (although this may be availability bias - I only ever use V2 now)
w
my guess is that within a single run schedulers are leaking, such that we have N stores open
👍 1
r
@hundreds-father-404 This is the command line, so I think I’m using v1:
./pants test tests/python/pants_test/backend/project_info/tasks:test_export_dep_as_jar
@witty-crayon-22786 trying now…
h
+1 that that PR is likely culpable
w
@hundreds-father-404: did that change end up being necessary...?
h
Yes, it was. Only way we could land the Pytest upgrade due to that
ZipError
issue John ran into last year
We needed the Pytest upgrade so that we can use
pytest-cov
to retry flakes
w
oof. k.
r
Still breaks
I’m going to try to restart my computer and try then
👍 1
h
Another thing you can try is using passthrough args via
--pytest-args='-k test_foo'
r
But I want to test every test in that file (I’m doing refactorings, so I want to constantly check that I’m not breaking anything)
Okay, reverting 998ffbb7c86ca33a2156b002a9f18eb6f1e425b8. and restarting works for now, will report if things break
It works even without that commit reverted, so I’m truly keffafled now
😕 1
This has been happening again (without the revert) every 3th-4th run of the suite. Going to try with the revert
h
Revert locally or revert in CI? Reverting in CI won’t work cleanly because we would have to roll back the Pytest upgrade
r
Locally
👍 1
It’s a pain to maintain the revert in the branch, but if it allows unblocking, it’s worth it 🙂
h
I wonder what it will take to fix the issues. Stu’s intuition sounds right that we’re not cleaning up the scheduler after every test properly
r
Yeah, correlates with the observations. however, I don’t know where we would keep them around between runs, because currently I’m not using patnsd at all, so all the memory should be freed
h
An important detail is that Pytest runs tests sequentially. Once a single test finishes, it’s supposed to be cleaned up and only then does the next test start. This implies that the issue is cleanup, rather than too many tests setting up schedulers at the same time
r
But then the behaviour of “once it fails, it fails until restart” is not explained, it should either fail always or not at all
As in, it seems pretty deterministic
👍 1
h
Yes I think the bad state is being persisted across pants tests and even pants runs, until you force a clean via restart
r
Yeah, so we are back at “where do we keep that state”. I don’t think it’s lmdb, and also probably not a process in the OS. Actually, let me check that last thing
👍 1
yeap, no, we don’t keep any process named “pants” around
h
I've been noticing this as well
what I'm seeing on my linux machine is that my /tmp dir is getting filled up with pants-related files
and also the amount of space by-default allocated to /tmp is actually fairly small, 4 GB on my system
but yeah I have a bunch of
process-exeution-<random>
directories in /tmp
if I wipe them out it goes away, but this is the 2nd time today I've done that
👍 1
r
It just happened again, tried wiping /tmp, but no such luck.
h
are you using Pantsd?
d
I ran into this again yesterday 😞
r
are you using Pantsd?
No 😞
h
I've just had this happen when running integration tests in the pants repo: The test run is fine, but the underlying test runs of pants get this error and so the tests fail. But then running some non-integration tests continues to be fine.
Also, running pants in another repo continues to be fine.
h
so, I'm looking into this now, and I can't seem to replicate the no space left on device error that we were all seeing ~2 weeks ago
I am seeing my
/tmp
directory fill up with tmpdirs as tests run, to a maximum of 15%/25% full depending on which tests I'm running (my /tmp file system is 4GB)
I'm also not seeing those ``process-exeution-<random>` directories show up, so maybe that specifically was the problem
and it looks like pants is cleaning up /tmp dir files even if I kill the tests with ctrl-C
it looks like some nailgun code is invoking a method that creates tmpdirs with
process-execution
as its prefix, maybe the problem was only ever nailgun-specific tests?
h
I think @dry-analyst-73584 encountered it when running Python tests that didn’t use nailgun. Do we only use nail gun for Jvm related tests?
h
that's just a guess on my part. I'm looking at the code that creates tmpdirs called
process-execution-<random>
which isn't actually just nailgun, my mistake
(specifically in
process_execution/src/lib.rs
in the function
run_and_capture_workdir
)
👍 1
h
https://github.com/pantsbuild/pants/pull/8621 is likely very relevant if you haven’t already looked at it. I think it results in setting up far more schedulers than we did before. Unfortunately, we can’t revert this PR because we need it to use modern Pytest
h
I'll take a look
but either way, I can't seem to replicate the problem on my own machine right now
and I definitely was seeing it a while ago
h
fwit, I haven’t encountered it the past 3 days
h
oh I remember this commit, yeah
this was definitely a problem ~2 weeks ago, and if neither of us have seen it in the past 3 days, maybe it got fixed by some recent commit? of course it would be good to know what commit that was, if that's the case