Ugh. Is there any magic bullet for reducing pants'...
# general
a
Ugh. Is there any magic bullet for reducing pants' memory footprint? Doing a
pants --changed-since=$(git merge-base HEAD origin/master) --changed-dependents=transitive list
takes 23GB of memory in our repo 😞
😮 1
Smart solutions, like reducing the number of targets aren't particularly useful, unfortunately, at least short term we're stuck here.
c
I’m not experienced in this area, but just out of curiosity how many targets do you have, roughly?
a
Many 😞
c
Also, this is a topic @flat-zoo-31952 have been wrestling with, I believe..
a
Copy code
$ pants list :: | wc -l
28785
👍 1
So we're trying to upgrade to 2.21, and we have to move to lockfiles/resolves. But, we have to have basically the same resolve stuff 6 times. 😞
💯 1
Until we can either merge the dependencies or take the time to split them better...
It might be only tests that get one target per resolve, though, hm
Before adding the lockfiles,
10943
targets and the same command peaked at 7GB of memory.
On 2.17
ANd, yeah, Josh and I talked about large graphs and other similar stuff in the past 🙂 back then, there wasn't really any solution to this.
f
I'm not really sure. I don't think the Pants project is in a position to fix these things tbh. I took a look at what it would require, and while it's probably fixable, it would require a ton of work in Pants core, and most of the devs who have the biggest knowledge there are focused on other projects at this point.
a
It's a bit sad to say "oh, go monorepo. but not too big" 😞
f
To be really honest, I've essentially decided that Pants' fine grained dependency model is probably bad for large monorepos. Even if the perf/memory issues were fixed, I'm not sure it maps to how humans think about projects anyways. The number of verticies in the graph just doesn't fit any any normal person's head, which can make you code architecture really difficult to reason about. In our current work monorepo repository we're trying to adopt a coarse-grained dependency model that explicitly pushes developers to think about libraries, applications, artifacts and boundaries, but without forcing a split to a separate repo. Seems to be more like what JS monorepo tools do.
Pants is still used and will probably continue to be used, but probably not for every project, and certainly not as a single Pants install in the repo root
a
The problem is that we do define deps at a library level, but pants expands this to "oh, library X? you mean, files X-Y from library X, amirite?"
f
Yeah this is what I mean by fine-grained dependencies
It's a fundamental design decision in Pants that I'm pretty sure just does not scale well
a
Yeah, but what I mean is we don't use inference or anything, we actually do define these correctly.
f
Or if it does scale, it only does so in specific circumstances
It doesn't matter whether you use inference or not, Pants needs to build a graph for every single target
And it uses its own execution graph engine to build that dependency graph, and that's where the memory usage is
a
Yeah, but for inference, I'd understand the per file thing. ANd it would definitely help us is many places, to prune the graph
but it'd only reduce the number of vertices, not nodes
f
Pants wasn't built to not use inference
The option exists, but it doesn't result in any performance improvement
If your deps are all spelled out correctly manually, it seems like you could probably move to Bazel easier than most (although it is my understanding that Bazel hits this memory problem at some point too)
a
I think we're quite locked in to pants, even if we need 25GB of memory to run a fmt 🙂
If for no other reason than nobody in management will approve of us switching to another build system, wasting months of developer time on something that can be fixed by throwing more money at hardware
f
There's also Polylith https://pantsbuild.slack.com/archives/C046T6TA4/p1715508080981669 to look at, which I think uses the operating model I describe
Yeah for us, we have always been in this weird position of not really using Pants to its potential because we consume deps from RPM repos, not from PIP repos
So it's a lot easier to phase it out as we work on new tooling
a
Obviously, I couldn't even get my company to sponsor pants (we got acquired just as we were about to sign, and then... the mothership is ignoring us when we ask if we can get the budget for it 😞 ), so it's not like I can advocate for any change to this, but yeah
😕 1
f
We're finally doing a long-awaited great refactoring, so doing a bit of homespun build tooling is part of that
a
I know this is completely unrelated, but I'm not sure at what point people starting thinking creating the virtualenv in your project's root is okay, but I hate it 😛
And, heh, I'm not sure if my alcohol fuelled yak shaving drove me to too much alcohol, but that combo won't actually fix any of the issues we're having
f
I yak-shaved on this myself quite a bit, ended up: • trying to rewrite a bit of https://github.com/pantsbuild/pants/blob/1d62bbaf7ef8c5b2a8821a9af0ca519d590010b1/src/python/pants/engine/internals/graph.py#L4 but too much depends on it • adding GC to the internal graph but I don't know rust well enough • trying to create a benchmark repo for reproducing these issues but it really was never as bad as what I was seeing at work What it comes down to is not having the time or expertise to solve or even really debug these problems
w
Copy code
$ pants list :: | wc -l
Copy code
28785
So, just under 1MB per target? That sounds like… a lot... @flat-zoo-31952 Were you seeing numbers like that too?
Ah, wait, conflating two things - the "transitive" call was what brings you up to 23GB?
a
Yes
w
And how long does the call take via
time
or similar?
a
Hm, give me a few minutes, it takes 2 minutes on the 28k one
It's a bit misleading though, since the 28k one errors out
w
Okay, and how are you tracking memory usage of the call (so I can test on some repos I have)?
a
ah, well, I noticed this on circleci, it OOMd on a 16GB docker container, then looked at activity monitor on my mac and htop.
👍 1
I can try it on a linux laptop if it matters, but I doubt it'll be different.
anyway, it took 1:20 for the 11k targets one
w
And silly question, but relevant only because I think someone else had a similar issue a couple weeks ago, are there any symlinks in the repo?
a
I doubt it
Nope, no symlinks
w
Dag... This was what I was referencing, was hoping you just caught a stray like this thread https://pantsbuild.slack.com/archives/C046T6T9U/p1718188313286379
a
I can tell you, over the past 2 years, as we've upgraded pants, the memory usage went up significantly. I think it was 2.9->2.14 when we saw it, but yeah. We did increase the number of targets by a lot, but the memory increase wasn't linear, that's for sure
Tomorrow, I can try to run this on the repo 2 years ago, I'm 100% sure it won't go above 2-3 GB
w
👍 Trying to find some comparables, but I'm not going much past 3-4GB on a repo with about 1/3rd the targets
So, yeah, definitely feels non-linear
Hmm, interestingly though - when I'm running in WSL, I seem to get multiple pantsd instances for the same repo - will need to dig into that, as they are not multiplying memory usage - but there are multiple PIDs
a
the behaviour doesn't change with --no-pantsd
w
Yeah, I figured - I wonder if this is a pants or wsl artifact - multiple PIDs, but I kill one, and it kills all - and we seem to only be using 1 Pant's worth of memory
a
I can ask a coworker to run this in WSL, if you think it'd give you any useful info 🙂
w
nah, there is something deeper going on - I just had a WSL sitting around
How many resolves/deps are you grabbing as well?
a
Hm, we have.... 21 resolves, 315 lines in our main requirements.txt
Not sure how to answer that exactly
13 if we exclude the tools resolves, btw
w
👍 And if you're able to succeed in running the command (big oof here):
Copy code
pants --stats-memory-summary {the command}
I think an anonymized version of that information might be useful?
a
Where does that write the info?
w
it should write it to the command line,
a
Untitled
w
Bah, exception? Should look something like:
Copy code
pants.git/call-by-name-cue-deb-make-sql % pants --stats-memory-summary list :: | wc -l                                                                                                       ⎇ call-by-name-cue-deb-make-sql*
17:38:20.85 [INFO] Memory summary (total size in bytes, count, name):
  48            1               pants.backend.javascript.package_json.AllPackageJsonNames
  48            1               pants.backend.project_info.filter_targets.FilterSubsystem
  48            1               pants.backend.project_info.list_targets.List
  48            1               pants.backend.project_info.list_targets.ListSubsystem
  48            1               pants.backend.python.dependency_inference.subsystem.PythonInferSubsystem
  48            1               pants.backend.python.goals.lockfile.PythonSyntheticLockfileTargetsRequest
  48            1               pants.backend.python.subsystems.setup.PythonSetup
  48            1               pants.backend.scala.subsystems.scala.ScalaSubsystem
  48            1               pants.backend.scala.subsystems.scala_infer.ScalaInferSubsystem
a
Let me try it on the 2.17 one, the exception... gonna take a while to fix
Untitled
w
Whoa That's... a lot of `AddressInput`s? I'll take a look at this trace (probably tomorrow) and compare against some of my repos, but it's good to see where a bunch of the memory goes.
Copy code
88377576		1052114		builtins.AddressInput
Would you call this a regression in memory, then? As in, the same repo, same command works on 2.17, but falls over on 2.21?
f
It’s been a while since I looked at this last but 0.5-1 MB per target in the repo sounds about right. IIRC it reduced to anything hitting
AllDepenciesRequest
or something like that. So
pants dependencies —transitive ::
was one of the simpler ways to trigger that
I’m fairly certain the memory overuse is a bug of some sort, but I was never able to isolate what triggers it
For us the memory usage just gets kinda steadily worse each Pants version (and steadily worse as our repo grows in number of targets)
w
I can understand it getting linearly worse with number of targets (and if they’re tightly coupled, maybe the dependency graph gets nutty), but getting worse each Pants version isn’t cool. Something we should be watching in regression (including perf).
ValueError: The explicit dependency
requirements/python#connexion
of the target at
aio/aio/__init__.py:../lib
does not provide enough address parameters to identify which parametrization of the dependency target should be used.
The first trace actually had what seems to be a genuine error in it?
h
The dep granularity feedback would be really valuable on the new python backend wishlist
s
Since you mentioned 21 resolves https://github.com/pantsbuild/pants/issues/20568
💯 1
h
Feels like the graph representation should be rewritten to be far more compact
I mean, that information is a strict subset of the information in the source files, which presumably are a lot smaller than 23GB...
So something is gratuitously wasteful here
💯 2
w
Without having dug too far into it yet, I'm curious if the problem is more related to what Gregory linked to - rather than strictly first-party sources. Even on some of the larger projects I've worked on, with lots of first-party code (only), I haven't noticed anywhere near 23GB. But, start throwing in some AI libs, and if those are pulled in process at any point, I can see everything hitting the fan.
a
We have inference turned off, so that shouldn't matter.
🤨 1
f
When I was investigating this, I remember looking at some code paths and noting that inference is not really the problem. The code for building the python dependency graph is quite complex, and inference is really a small part of it. IIRC adding resolves had a huge impact on memory, but it's been so long since I've done this analysis it's hard to remember.
a
Also, not only does it spike at 23GB, but it's sloooooooow:
Copy code
cbirzan@GP3CMXYV9V:~/PycharmProjects/cr_python$ pants --tag=-no_mono --changed-since=6ed4c2c0edf22f1e796f56dd57f0196113cbad5d --changed-dependents=transitive list
17:51:40.99 [INFO] Reading /Users/cbirzan/PycharmProjects/cr_python/.python-version to determine desired version for [python-bootstrap].search_path.
17:51:41.05 [INFO] Reading /Users/cbirzan/PycharmProjects/cr_python/.python-version to determine desired version for [python-bootstrap].search_path.
17:56:19.47 [INFO] Reading /Users/cbirzan/PycharmProjects/cr_python/.python-version to determine desired version for [python-bootstrap].search_path.
⠓ 1154.28s Map all targets to their dependents
w
You mentioned 28785 total targets - how many python files are in src in total?
a
Copy code
$ find . -name '*.py' | wc -l
    7309
A few have the
no_mono
tag, but I guess those still go in for the graph.
Now it's just taking the piss...
Copy code
18:19:15.53 [INFO] Filesystem changed during run: retrying `@rule(pants.backend.project_info.dependents.map_addresses_to_dependents())` in 500ms...
almost 30 minutes in, heh
okay, so, is there a way to upgrade pants without using lockfiles? 😄
f
I've wondered if the retries are where the memory leak is coming from
a
it's at around 11-13GB now
h
This seems pathologically wrong
a
That's the feeling I've been getting working on this PR 😄
h
@ancient-france-42909 this debugging is golden! I will dive into improving this if you can help come up with a smoking gun
I have a feeling something very stupid is happening under the covers
w
^^ Exactly, this seems so wildly off-kilter that there must be some sort of redundant work, recursion, or something equally wild. I feel like some sort of 3rd party deps are coming into proc unnecessarily, and those are big enough to easily wipe out your process memory. I was testing on Josh's clam diggers repo, and while it took a lot of clock time, I didn't see any particular crazy memory spikes
Or like... over memoization - like somehow we're re-memoizing the same data multiple times
f
It would be nice to be able to trace requests through the engine
Just need to wire it all up to a Prometheus backend 😄
🔥 1
But seriously, the Pants execution engine is complex and asynchronous enough where it has a lot more in common with a distributed system than a conventional program, at least when you're trying to understand things at this level of detail. Reading
-ldebug
logs didn't really get me anywhere, just because so many log statements are really really generic, so you can't really trace what's happening with a particular request
I too have a feeling something dumb is happening under the hood, but I was never really able to isolate it, and without publicly available reproducers, none of the other contributors can either.
"Clam Diggers" is really simplistic. Maybe activating resolves or adding some real 3rd party dependencies might induce the behavior.
w
Yep, that's what I was trying. First get a baseline, then start making it more complicated piece by piece. And then I fell asleep waiting for it to finish
😩 1
a
ugh, to get the apple profiler you need an apple id
Well, that's awkward. I added another resolve and now it errors out in 3 minute 😄
w
Was it seqeval?
a
nah... So, the way I'm going about this, we have, basically, 6 sets of resolves that matter. We have the base one (
global
), that has the overwhelming majority of packages, and then a few overrides for specific libs. We're in the process of upgrading to
connexion
3, so we have a
connexion_3
resolve that has connexion 3 version, and also includes
global
. We then have a
global_2
resolve, that has
connexion
2 inside. And all apps that use
connexion
2 use
global_2
. So we have a default
resolve=parametrize('global', 'global_2', 'connexion_3')
at the top level. Anyway, I had forgotten to add one of these resolves to the parametrize, and it was taking 30+ minutes to print out the error message that it can't find a library (which, btw, is very confusing).
oh, wow, it worked.
It might be related to trying to find deps that don't exist, will test let me run without pantsd to see how long it takes.
🤔 1
Okay, about 3 minutes, that's not too far off from how much it took before.
f
a
I'm not saying that's generally the issue, but it's suspicious it went from 30+ minutes to 3 when I fixed that
f
Oh yeah... I was just wondering about the memory consumption
a
Also, weirdly... When I tried to generate the lockfiles for all the resolves, it took 15 minutes, but each individual one takes ~3-4 minutes
oh, let me check memory
Peaks at 13, now it's down to 9 and still going down, before it would go to 25 within a few seconds.
this is getting ridiculous
okay, let me try to break it
Okay, so, I reverted to a version where it cannot find dependencies because of resolves not being correct, and it's slower and uses more memory.
but these runs always error out
f
@fresh-cat-90827 maybe this could be related to our fake 3rd party deps
f
Catching up on the thread
f
The idea is that it's possible that missing dependencies could be a reason for extra memory usages. In George-Cristian's case they result in an error, but it's possible that ours don't since we never really use those faked deps for anything anyways, but perhaps resources go towards trying to figure them out anyways.
a
When I say missing deps, I mean they exist in another resolve than the attributed to a python_sources.
This is really weirdly inconsistent in the time it takes... On circleci it was taking 45 seconds to fail, vs 2 on my laptop, now it's taking 3 minutes to finish on my laptop but took 10 on the CI
But, yeah, with missing deps (pic 1), and without missing deps (pic 2)
Yeah, hm. Changed something and now it cannot figure out which of the 10 resolves it must use, memory went back up...
w
So, trying to grok the last few messages - you get an OOM when Pants is unable to figure out which resolve? requirement? to use - but when everything is using the correct resolve, no OOM?
a
I mean, it doesn't OOM, both my laptop and the CI have 32GB of memory
or, well, I have 36 because macs.
w
OOM ==> "memory explosion"?
a
Yeah
And it's slower too
But, now at least I have something to profile 🙂
🎉 1
It's somewhat interesting, but also misleading. Most of the time is spent in python, which is not necessarily unexpected, but at least something. Also, misleading because half of the python time is spent waiting for the GIL, heh.
If py-spy is to be believed, this is kind of how it looks:
Recording gives me something else...
py-spy seems to be unable to keep up at even 50 snapshots per second towards the end... And it slows down things a loooot
I captured about 10 minutes from both, but since they didn't finish...
Gonna let it finish and go write some feedback about for the new python backend :)
Copy code
py-spy> 52.45s behind in sampling, results may be inaccurate. Try reducing the sampling rate
been running for more than 30 minutes, heh
Speedscope trace of the slow one, 30 samples per second.
I think the very low sample rate makes these kind of useless. And I also think the high number of threads screws it up, would setting the max parallelism variable limit that?
f
I wonder if one of the things we could try would be to eliminate our "faked" dependencies and then allow uninferrable deps again and see if that has an effect
a
I might try to reproduce this, but this is... not a priority for us right now, so can't really spend that much time on it 😞 Plus, it turns out moving to resolves is a pain in the ass, I have to basically annotate every lib by hand 😞
so it being slow is not that important, the fact that I have to spend days on this is a bigger issue, heh
😩 2
@happy-kitchen-89482 Btw, is there any timeline for a new python backend? We were discussing this today, talking about whether it's worth spending the time to fix the issues we have, if we're gonna have to upgrade again and maybe have to re-do a lot of the work then
h
No timeline that you can rely on, no
What are the issues you're looking at?
a
Our biggest one is resolves being stupid to work with, and that type stubs thing, but that's easy to fix with, mostly, some sed magic. My main problem now is migrating a repo form no resolves to resolves, and it's just a game of whackamole. Oh, this is from the generic resolve, this is the generic without what we have overrides for, and then doing this for literally every target myself.
💯 1
💢 1
90% of the problem would go away if the default resolve would mean that any target with that resolve would use it, if it works with all it's dependencies.
h
Not quite grokking this, can you give an example?
a
Okay, imagine you have an app that needs a new version of lib X, but the rest of your codebase cannot use it. In non resolves pants, you can just add an override and it Just Works. In the new one, you have to make 3 resolves, 'generic without X', 'X old version', 'X old version'. You can have more than one requirements per resolve, so that's okay, but it's still annoying that you have to refer to it as
whatever:something-else#X
and
whatever:something-else2#X
vs
whatever:python#X
, since you cannot have more than one requirements with the same name, even if they're in different resolves, but that's okay-ish. But now comes the hard part, let's say a library depends on
X
. All your libs have to have a resolve of
generic
, if they don't depend on
X
, and then
generic-with-old-X
if they depend on the old one,
generic-with-new-X
if they depend on the new one.
And on a large repo, this becomes tricky, you can't even use sed or something to do it, you have to parse the AST of
BUILD
files and write a new one, since formatting can sometimes put function calls on the next line, and that's just painful (not that I agree with very long lines 😄 )
Back when we upgraded through the version that made it so you have to put
conftest.py
in another target, I did just that, but it was painful
Was that okay, or should I try to give an example? 😄
s
I just read through this thread, and +1000 to @ancient-france-42909, I arrived at the same thought myself a bit ago. To have a "neutral" resolve or
None
resolve, so that a target could be shared between other resolves would be so helpful. https://github.com/pantsbuild/pants/issues/21194#issuecomment-2253596097
https://pantsbuild.slack.com/archives/C046T6T9U/p1719940826887709?thread_ts=1719856232.324909&cid=C046T6T9U Is there any good way to debug something like this? After enabling resolves,
Map all targets to their dependents
now takes 20+ minutes 😕