Hello I m constantly running into this issue for the last co Pants #development

Hello! I’m constantly running into this issue for ...

red-balloon-89377

12/11/2019, 4:48 PM

Hello! I’m constantly running into this issue for the last couple of days, if anyone has any insight: When I run

./pants test tests/python/pants_test/backend/project_info/tasks:test_export_dep_as_jar

which has 10-15 unit tests, at some point I start getting

Exception: Could not initialize store for process cache: "Error making env for store at \"/private/var/folders/rm/zwytnntn5013gslf3xlv31rc0000gp/T/tmp6y9cz45d/processes/7\": No space left on device

in all the tests, which can only be solve by restarting. Until now, when I just restarted but the problem persisted. Things that I’ve checked: • I have plenty of memory left, and I’m not swapping. • The volume has enough space and inodes left. • There are no deleted files held by processes. Any ideas?

😢 1

hundreds-father-404

12/11/2019, 4:49 PM

@dry-analyst-73584 has been running into this problem frequently too

hundreds-father-404

12/11/2019, 4:50 PM

Could it be a space leak by the engine?

red-balloon-89377

12/11/2019, 4:50 PM

But I just restarted, and the disk has enough space left

🤔 1

witty-crayon-22786

12/11/2019, 4:50 PM

"space" here being virtual memory.

👍 1

witty-crayon-22786

12/11/2019, 4:50 PM

LMDB is very virtual memory heavy.

witty-crayon-22786

12/11/2019, 4:50 PM

but unless there are orphaned processes, it shouldn't be held.

red-balloon-89377

12/11/2019, 4:51 PM

That would be checked by a good ol’

ps aux | grep pants

, right?

witty-crayon-22786

12/11/2019, 4:51 PM

@red-balloon-89377: are you able to try with 998ffbb7c86ca33a2156b002a9f18eb6f1e425b8 reverted? ... might not be a clean revert, unfortunately.

👍 1

hundreds-father-404

12/11/2019, 4:52 PM

Also you’re using the V1 test runner and getting this? Huh. I think I’ve only gotten it using the V2 test runner (although this may be availability bias - I only ever use V2 now)

witty-crayon-22786

12/11/2019, 4:52 PM

my guess is that within a single run schedulers are leaking, such that we have N stores open

👍 1

red-balloon-89377

12/11/2019, 4:53 PM

@hundreds-father-404 This is the command line, so I think I’m using v1:

./pants test tests/python/pants_test/backend/project_info/tasks:test_export_dep_as_jar

@witty-crayon-22786 trying now…

hundreds-father-404

12/11/2019, 4:53 PM

+1 that that PR is likely culpable

witty-crayon-22786

12/11/2019, 4:56 PM

@hundreds-father-404: did that change end up being necessary...?

hundreds-father-404

12/11/2019, 4:56 PM

Yes, it was. Only way we could land the Pytest upgrade due to that

ZipError

issue John ran into last year

hundreds-father-404

12/11/2019, 4:57 PM

We needed the Pytest upgrade so that we can use

pytest-cov

to retry flakes

witty-crayon-22786

12/11/2019, 4:57 PM

oof. k.

red-balloon-89377

12/11/2019, 4:59 PM

Still breaks

red-balloon-89377

12/11/2019, 4:59 PM

I’m going to try to restart my computer and try then

👍 1

hundreds-father-404

12/11/2019, 5:00 PM

Another thing you can try is using passthrough args via

--pytest-args='-k test_foo'

red-balloon-89377

12/11/2019, 5:06 PM

But I want to test every test in that file (I’m doing refactorings, so I want to constantly check that I’m not breaking anything)

red-balloon-89377

12/11/2019, 5:07 PM

Okay, reverting 998ffbb7c86ca33a2156b002a9f18eb6f1e425b8. and restarting works for now, will report if things break

red-balloon-89377

12/11/2019, 5:31 PM

It works even without that commit reverted, so I’m truly keffafled now

😕 1

red-balloon-89377

12/13/2019, 3:20 PM

This has been happening again (without the revert) every 3th-4th run of the suite. Going to try with the revert

hundreds-father-404

12/13/2019, 3:22 PM

Revert locally or revert in CI? Reverting in CI won’t work cleanly because we would have to roll back the Pytest upgrade

red-balloon-89377

12/13/2019, 3:22 PM

Locally

👍 1

red-balloon-89377

12/13/2019, 3:22 PM

It’s a pain to maintain the revert in the branch, but if it allows unblocking, it’s worth it 🙂

hundreds-father-404

12/13/2019, 3:24 PM

I wonder what it will take to fix the issues. Stu’s intuition sounds right that we’re not cleaning up the scheduler after every test properly

red-balloon-89377

12/13/2019, 3:25 PM

Yeah, correlates with the observations. however, I don’t know where we would keep them around between runs, because currently I’m not using patnsd at all, so all the memory should be freed

hundreds-father-404

12/13/2019, 3:27 PM

An important detail is that Pytest runs tests sequentially. Once a single test finishes, it’s supposed to be cleaned up and only then does the next test start. This implies that the issue is cleanup, rather than too many tests setting up schedulers at the same time

red-balloon-89377

12/13/2019, 3:27 PM

But then the behaviour of “once it fails, it fails until restart” is not explained, it should either fail always or not at all

red-balloon-89377

12/13/2019, 3:27 PM

As in, it seems pretty deterministic

👍 1

hundreds-father-404

12/13/2019, 3:28 PM

Yes I think the bad state is being persisted across pants tests and even pants runs, until you force a clean via restart

red-balloon-89377

12/13/2019, 3:29 PM

Yeah, so we are back at “where do we keep that state”. I don’t think it’s lmdb, and also probably not a process in the OS. Actually, let me check that last thing

👍 1

red-balloon-89377

12/13/2019, 3:29 PM

yeap, no, we don’t keep any process named “pants” around

hundreds-breakfast-49010

12/13/2019, 10:54 PM

I've been noticing this as well

hundreds-breakfast-49010

12/13/2019, 10:55 PM

what I'm seeing on my linux machine is that my /tmp dir is getting filled up with pants-related files

hundreds-breakfast-49010

12/13/2019, 10:55 PM

and also the amount of space by-default allocated to /tmp is actually fairly small, 4 GB on my system

hundreds-breakfast-49010

12/13/2019, 10:56 PM

but yeah I have a bunch of

process-exeution-<random>

directories in /tmp

hundreds-breakfast-49010

12/13/2019, 10:56 PM

if I wipe them out it goes away, but this is the 2nd time today I've done that

👍 1

red-balloon-89377

12/16/2019, 2:07 PM

It just happened again, tried wiping /tmp, but no such luck.

hundreds-father-404

12/16/2019, 2:40 PM

are you using Pantsd?

dry-analyst-73584

12/16/2019, 5:32 PM

I ran into this again yesterday 😞

red-balloon-89377

12/16/2019, 5:33 PM

are you using Pantsd?

No 😞

happy-kitchen-89482

12/18/2019, 11:22 PM

I've just had this happen when running integration tests in the pants repo: The test run is fine, but the underlying test runs of pants get this error and so the tests fail. But then running some non-integration tests continues to be fine.

happy-kitchen-89482

12/18/2019, 11:23 PM

Also, running pants in another repo continues to be fine.

hundreds-breakfast-49010

12/27/2019, 10:16 PM

so, I'm looking into this now, and I can't seem to replicate the no space left on device error that we were all seeing ~2 weeks ago

hundreds-breakfast-49010

12/27/2019, 10:17 PM

I am seeing my

/tmp

directory fill up with tmpdirs as tests run, to a maximum of 15%/25% full depending on which tests I'm running (my /tmp file system is 4GB)

hundreds-breakfast-49010

12/27/2019, 10:18 PM

I'm also not seeing those ``process-exeution-<random>` directories show up, so maybe that specifically was the problem

hundreds-breakfast-49010

12/27/2019, 10:18 PM

and it looks like pants is cleaning up /tmp dir files even if I kill the tests with ctrl-C

hundreds-breakfast-49010

12/27/2019, 10:39 PM

it looks like some nailgun code is invoking a method that creates tmpdirs with

process-execution

as its prefix, maybe the problem was only ever nailgun-specific tests?

hundreds-father-404

12/27/2019, 10:41 PM

I think @dry-analyst-73584 encountered it when running Python tests that didn’t use nailgun. Do we only use nail gun for Jvm related tests?

hundreds-breakfast-49010

12/27/2019, 10:42 PM

that's just a guess on my part. I'm looking at the code that creates tmpdirs called

process-execution-<random>

hundreds-breakfast-49010

12/27/2019, 10:42 PM

which isn't actually just nailgun, my mistake

hundreds-breakfast-49010

12/27/2019, 10:42 PM

(specifically in

process_execution/src/lib.rs

in the function

run_and_capture_workdir

)

👍 1

hundreds-father-404

12/27/2019, 10:45 PM

https://github.com/pantsbuild/pants/pull/8621 is likely very relevant if you haven’t already looked at it. I think it results in setting up far more schedulers than we did before. Unfortunately, we can’t revert this PR because we need it to use modern Pytest

hundreds-breakfast-49010

12/27/2019, 10:48 PM

I'll take a look

hundreds-breakfast-49010

12/27/2019, 10:49 PM

but either way, I can't seem to replicate the problem on my own machine right now

hundreds-breakfast-49010

12/27/2019, 10:49 PM

and I definitely was seeing it a while ago

hundreds-father-404

12/27/2019, 10:49 PM

fwit, I haven’t encountered it the past 3 days

hundreds-breakfast-49010

12/27/2019, 10:49 PM

oh I remember this commit, yeah

hundreds-breakfast-49010

12/27/2019, 10:50 PM

this was definitely a problem ~2 weeks ago, and if neither of us have seen it in the past 3 days, maybe it got fixed by some recent commit? of course it would be good to know what commit that was, if that's the case

Open in Slack

Previous Next