Hi everyone I ve go that performance scaling case up on gith Pants #development

Hi everyone. I've go that performance scaling cas...

fast-school-44220

10/22/2025, 7:46 PM

Hi everyone. I've go that performance scaling case up on github now. https://github.com/NVIDIA/Dependency-Patterns/tree/main/scaling

pants list ::

takes about 2 minutes reports 163,375 targets.

pants run :go

(which is just a verb I picked, it has nothing to do with the "go" language or tools) runs for days and grows to fill all available memory until IT kills the process for hogging farm resources.

👍 1

fast-school-44220

10/22/2025, 7:47 PM

This is the case I was referencing on the other thread about free thread Python.

fast-school-44220

10/22/2025, 7:48 PM

I suspect it might be an issue specific to the adhoc_tool backend, but it might also be in the transitive dependency logic that's used across the board.

wide-midnight-78598

10/23/2025, 1:44 AM

That's.... That's a lot of targets. Is that the correct amount of targets? Or is it adding spurious ones?

fast-school-44220

10/23/2025, 2:20 AM

This is a real build DAG from our current system at the directory level. Obviously the work in each directory is synthetic and hugely simplified over the real thing.

wide-midnight-78598

10/23/2025, 11:42 AM

👍

happy-kitchen-89482

10/23/2025, 3:41 PM

What is the hardware this is running on? I get 20-30s for

pants list ::

in that repo

happy-kitchen-89482

10/23/2025, 3:41 PM

still more than I would like

happy-kitchen-89482

10/23/2025, 3:41 PM

I would like to reimplement all the graph stuff in rust

fast-school-44220

10/23/2025, 3:42 PM

I'm guessing the

pants list ::

might be IO bound and I'm on NFS storage.

happy-kitchen-89482

10/23/2025, 3:47 PM

I'm on macOS on a 2022 macbook pro, for reference

fast-school-44220

10/23/2025, 3:49 PM

I'm on whatever VM IT feels like provisioning for me. 🙂

happy-kitchen-89482

10/23/2025, 3:49 PM

The

pants run

cmd is running for a few minutes now, I don't see a memory leak, but you said it takes days so maybe it's too slow to notice yet

fast-school-44220

10/23/2025, 3:49 PM

it's a slow grow, but I'd expect it to kick in around 10-15 minutes in.

happy-kitchen-89482

10/23/2025, 3:57 PM

I'll keep watching it

happy-kitchen-89482

10/23/2025, 4:05 PM

OK, I'm seeing it

happy-kitchen-89482

10/23/2025, 4:06 PM

So there are two issues: 1) memleak, 2) it takes days ?

fast-school-44220

10/23/2025, 4:07 PM

Well, 2 symptoms, but I'm guessing they're a single underlying issue.

happy-kitchen-89482

10/23/2025, 4:10 PM

Is there an issue open for this? Probably best to discuss there? it's less ephemeral, and this is a fun one

fast-school-44220

10/23/2025, 4:10 PM

Not yet. I'll do that this afternoon after my meetings.

happy-kitchen-89482

10/23/2025, 4:17 PM

I want to figure out what the real solution is. Even if we fix the memory leak, I'm not sure the execution time will improve much if running ~160K processes on a single machine

happy-kitchen-89482

10/23/2025, 4:19 PM

So let's discuss on an issue (and I also want to be sure I understand what real thing these dummy processes are modeling)

happy-kitchen-89482

10/23/2025, 4:21 PM

OK, that leak started to run away quite rapidly and then things crashed

happy-kitchen-89482

10/23/2025, 4:21 PM

This is after ~45 minutes

happy-kitchen-89482

10/23/2025, 4:31 PM

So this may not a memory leak per se, so much as the fact that the in-memory state of the build graph grows as rules execute, and this is a very large build graph. It would be interesting to see which data structures are involved.

fast-school-44220

10/23/2025, 5:04 PM

FWIW, bazel does not fork bomb the machine and it completes in 9-10 minutes on my system.

fast-school-44220

10/23/2025, 7:49 PM

https://github.com/pantsbuild/pants/issues/22789

fast-school-44220

10/24/2025, 7:02 PM

Changing from

shell_sources

files

and running with

--no-pantsd

has made it get way further than it ever did before. Still using a ton of memory, but I grabbed a big machine in the farm, so that's ok.

Copy code

2426815 svenier   20   0  317.0g  48.1g 335500 S 100.0   2.1  81:23.92 python3.11

fast-school-44220

10/27/2025, 8:39 PM

would it be reasonable to change https://github.com/pantsbuild/pants/blob/06e2167524db3406c56b90081b48bc350eec6721/src/python/pants/core/util_rules/adhoc_process_support.py#L342 to only look at direct dependencies rather than transitive (on an opt-in basis)? I suspect that would really speed up my case (with no evidence).

fast-school-44220

10/27/2025, 8:40 PM

If so, what is the call by name replacement for https://www.pantsbuild.org/stable/docs/writing-plugins/the-rules-api/rules-and-the-target-api#the-dependencies-field ?

wide-midnight-78598

10/27/2025, 8:43 PM

Not speaking to the adhoc thing, but Benjy updated the docs and the next version is here re: call-by-name: https://github.com/pantsbuild/pants/blob/main/docs/docs/writing-plugins/the-rules-api/rules-and-the-target-api.mdx

gratitude thank you 1

fast-school-44220

10/27/2025, 8:58 PM

in that page the call to resolve_targets has the arguments passed inside the call to implicitly() instead of before it. Is that intentional?

fast-school-44220

10/27/2025, 8:59 PM

hmm.. I see it elsewhere on the page too, so I'm going to assume it is correct. It just looks strange to see implicitly used differently in this one case.

wide-midnight-78598

10/27/2025, 10:04 PM

There are different

implicitly

use cases depending on what’s being ported over. There are 4 modes of “Get” syntax that were ported over. But, it’s also very 80/20 on what gets used where. https://github.com/pantsbuild/pants/blob/06e2167524db3406c56b90081b48bc350eec6721/src/python/pants/goal/migrate_call_by_name.py#L382 I think Benjy might have made mention to it somewhere in the docs, but that reminds me, it should be front and centre as a doc comment

wide-midnight-78598

10/27/2025, 10:05 PM

https://github.com/pantsbuild/pants/blob/06e2167524db3406c56b90081b48bc350eec6721/docs/docs/writing-plugins/the-rules-api/concepts.mdx#implicit-parameters

happy-kitchen-89482

10/28/2025, 4:23 PM

Ah that is interesting. You could be spending unnecessary time materializing indirect dep sources into each sandbox, when you only need the sources of the immediate dep. In the general case we do need the transitive deps, as we don't know what the ad-hoc process intends to use, but it would make a lot of sense to optionally allow just direct dependencies if you know those are sufficient.

happy-kitchen-89482

10/28/2025, 4:24 PM

Although you do still need to run the codegen for the transitive deps.

fast-school-44220

10/28/2025, 4:32 PM

~~Turns out that the system commands need the transitive deps. So we'd have to add some kind of flag for "stop here" to deps that we don't want to recurse on~~

fast-school-44220

10/28/2025, 8:52 PM

Strike that. Turns out that the cursor generated anonymization script goofed up the path to the shell script. After fixing that it looks like it's running just fine with direct dependencies only.

fast-school-44220

10/28/2025, 9:02 PM

I'll be refreshing the branches once this finishes running.

fast-school-44220

10/29/2025, 2:36 PM

ok, switching from transitive deps to direct deps took the runtime down from 7 hours to 1 hour, 11 minutes.

happy-kitchen-89482

10/29/2025, 7:30 PM

Nice!

happy-kitchen-89482

10/29/2025, 7:31 PM

I am going to hack on something today that I'm hoping will bring this down much further. I will provide more details once I have something promising.

wide-midnight-78598

10/29/2025, 7:48 PM

@fast-school-44220 what’s the memory looking like by the end?

fast-school-44220

10/29/2025, 9:11 PM

At the moment:

MAX MEM: 42945 Mbytes; AVG MEM: 40012 Mbytes;

but it hasn't been running for THAT long yet. I'll report back when it finishes.

fast-school-44220

10/29/2025, 9:12 PM

oh, actually, maybe I have a run in a different directory... let me check...

fast-school-44220

10/29/2025, 9:12 PM

This is with the files swap, but not the direct dependencies (in my non-anonymized directory, so I can't share it as is):

CPU time : 44110.98 sec.

Max Memory : 51362.64 MB

Average Memory : 49617.44 MB

wide-midnight-78598

10/29/2025, 9:12 PM

It might cause everything to explode, but we have some options to emit stats to disk. I'd be curious to see that file. https://www.pantsbuild.org/stable/reference/subsystems/stats

wide-midnight-78598

10/29/2025, 9:13 PM

I also can't imagine how big that file would be in the end

fast-school-44220

10/29/2025, 9:13 PM

heh. I've got to run for family stuff, but I'll try it out tomorrow.

wide-midnight-78598

10/29/2025, 9:18 PM

👍

fast-school-44220

10/30/2025, 4:02 PM

Yesterday's run did complete successfully. 3.25 hours, max memory 43922.92MB This was with files() rather than shell_sources() and only direct dependencies traversed and populated into the sandboxes. I'll push this to github after lunch and then kick off the stats run.

fast-school-44220

10/30/2025, 4:02 PM

one thing I'm not sure about now that I've read the stats page is how the cache has been impacting things. so that will be interesting to see.

fast-school-44220

10/30/2025, 5:52 PM

when run against a fully populated cache, with stats enabled, it took 325 seconds

wide-midnight-78598

10/30/2025, 5:52 PM

As in, the no-op (effectively) case was 325 seconds?

fast-school-44220

10/30/2025, 5:53 PM

yup. Not sure how much of that was stats overhead

wide-midnight-78598

10/30/2025, 5:53 PM

wow

wide-midnight-78598

10/30/2025, 5:53 PM

That's something... It does hit upon something I was debugging yesterday, where pulling in local cache was taking a bit longer than what I expected (though, not sure exactly what numbers I expected - it just took "longer")

wide-midnight-78598

10/30/2025, 5:53 PM

Granted, we're talking about scales of magnitude differences

fast-school-44220

10/30/2025, 5:54 PM

looks like there's nothing proprietary in the stats file if you'd like to see it.

wide-midnight-78598

10/30/2025, 5:55 PM

Yeah, if you can upload it, or if concerned, you can dm it to me

fast-school-44220

10/30/2025, 5:55 PM

let me just get a quick 2nd set of eyeballs on it to make sure I didn't miss anything

👍 1

fast-school-44220

10/30/2025, 6:05 PM

https://github.com/NVIDIA/Dependency-Patterns/wiki/Pants-stats

wide-midnight-78598

10/30/2025, 6:07 PM

Out of curiousity - have you ever checked out what's in a sandbox when you run this? I don't know if it's relevant,but just wondering if there is extra materializing going on - other than all the other bottlenecks, obviously 🙂

fast-school-44220

10/30/2025, 6:08 PM

certainly not exhaustively, but I did spot check a few when I was initially developing it. They're pretty minimal.

fast-school-44220

10/30/2025, 6:09 PM

they should be even smaller, now that I've switched to direct dependencies only

wide-midnight-78598

10/30/2025, 6:09 PM

👍 Figured, just checking. That was another thing I noticed in a very specific perf workflow I'm working on. it materializes more than we need

fast-school-44220

10/30/2025, 6:17 PM

the shell_to_file branch of github.com:NVIDIA/Dependency-Patterns.git is now up to date with what I've actually been building.

curved-manchester-66006

10/30/2025, 6:18 PM

(Just as a general perf tip that may or may not be applicable here: It helps with the sandbox dir, cache, and repo are on the same file system so that hardlinking is possible.)

fast-school-44220

10/30/2025, 6:19 PM

heh. well, I've taken that number down from 3 to 2, but I don't think it'll ever be 1

fast-school-44220

10/30/2025, 6:20 PM

and really, if/when this hits deployment at scale it'll be running on a BB or similar farm.

fast-school-44220

10/30/2025, 6:22 PM

what is the granularity of the cache entries? Just external process runs or are higher level things like entire rule invocations cached?

fast-school-44220

10/30/2025, 6:28 PM

huh. I'm running again w/o the stats file and it's running longer. But this time I didn't run it to a logfile. Apparently that dynamic terminal io is causing some significant perf overhead?

wide-midnight-78598

10/30/2025, 6:29 PM

Hard to say, but sometimes changing parts of the pants.toml or the command line args will trigger some re-work. I thought it was mostly related to startup, but I can't say crisply how limited that may be

fast-school-44220

10/30/2025, 7:40 PM

I ran it again without stats and it was 336 seconds this time. So, the stats overhead is negligible. that difference is in the noise of our farm hosts.

happy-kitchen-89482

10/31/2025, 4:32 AM

FWIW I wrote a custom plugin that bypasses a lot of heavyweight Pants cruft that isn't really pertinent to this use case (e.g., fully materializing a BUILD graph), and I'm currently profiling it

happy-kitchen-89482

10/31/2025, 4:32 AM

More tomorrow or over the weekend

wide-midnight-78598

10/31/2025, 4:33 AM

Did you have to monkey patch it in?

happy-kitchen-89482

10/31/2025, 4:35 AM

Not so far, it's a straightforward plugin,

wide-midnight-78598

10/31/2025, 4:37 AM

Ah, I'm working on some perf plugins, and it's monkeypatch city over here

happy-kitchen-89482

10/31/2025, 9:45 PM

I am seeing some absolutely bonkers things in the cpu profiles. Will need to do this on a linux machine to get native code in the pyspy profiles though, that isn't supported on macos.

fast-school-44220

10/31/2025, 10:13 PM

I'm happy to take a pre-alpha release and help with cleanup. Assuming you don't mind letting a perl guy at your code. 😁

happy-kitchen-89482

11/01/2025, 8:30 PM

An interesting finding... https://github.com/pantsbuild/pants/issues/22789#issuecomment-3476775097

fast-school-44220

11/03/2025, 11:46 AM

Dude, rock and roll!

fast-school-44220

11/03/2025, 4:37 PM

@happy-kitchen-89482 reading your custom_build plugin... the cache is doing some heavy lifting there. If two different directories depend on the same one, they're both going to try to build it and you're counting on the 2nd one hitting on the cache. What if it's not done yet? Is there a race condition?

happy-kitchen-89482

11/03/2025, 5:32 PM

Just to clarify terms, there is the on-disk cache of process results, and there is in-RAM memoization of "tasks" (rule executions, more or less). The latter does not survive a pantsd restart, the former does. Which is why as long as pantsd is alive a no op rebuild is 300ms, whereas after killing pantsd it takes 10 seconds to reestablish that in-memory state by reexecuting all the rules but reading the process results from cache.

happy-kitchen-89482

11/03/2025, 5:33 PM

So anyway, what we're relying on is memoization. There is no race condition. If a task is requested multiple times while still in flight it is only executed once. If it's requested again after having been memoized, you get that memoized result.

👍 1

happy-kitchen-89482

11/03/2025, 5:34 PM

tokio futures FTW

happy-kitchen-89482

11/03/2025, 5:34 PM

But presumably this is what you want?

happy-kitchen-89482

11/03/2025, 5:35 PM

Rules are supposed to be pure, with no side effects

fast-school-44220

11/03/2025, 5:35 PM

Yes, exactly what I want

happy-kitchen-89482

11/03/2025, 5:35 PM

For example this means that if you run your codegen or preprocessor in order to do dep inference, it won't run again during the build

happy-kitchen-89482

11/03/2025, 5:36 PM

(if you had them be separate passes for some reason, but my point in that demo is that they probably shouldn't be)

fast-school-44220

11/03/2025, 6:54 PM

hmm. My run crashed:

Copy code

08:52:58.41 [ERROR] 1 Exception encountered:

Engine traceback:
  in `custom-build` goal

IntrinsicError: An error occurred when attempting to materialize a working directory at "/tmp/pants-sandbox-93xFX3": Was not present in the local store: Digest { hash: Fingerprint<83e9a74e8be0d776d3d741eb3c600beb6917365cec18fb4e3e6b20c0019ee778>, size_bytes: 32 }

That kind of reads like a race condition.

fast-school-44220

11/03/2025, 6:55 PM

Though, I suppose it could be the result of running inside an NFS directory? (My workspace and ~/.cache, not /tmp)

fast-school-44220

11/03/2025, 8:39 PM

I deleted ~/.cache/pants and now things work correctly. Must have just been corrupted from a previous killed run?

fast-school-44220

11/03/2025, 8:40 PM

so total from scratch build was just under 14 minutes

happy-kitchen-89482

11/03/2025, 11:43 PM

Hmm, we see those in CI from time to time, they are not well-understood at present

happy-kitchen-89482

11/03/2025, 11:43 PM

Interesting that it is so much longer than on my local machine, but I'm guessing NFS is part of it?

fast-school-44220

11/04/2025, 1:28 AM

Yeah, that's my assumption. I'm tempted to check it out into local storage in /tmp, but given that our real work has to happen in NFS land for Reasons, I'm not sure it's worth the effort.

fast-school-44220

11/06/2025, 9:19 PM

@happy-kitchen-89482 could one conceivably make a

tailor

goal to generate your

build.json

files instead of

BUILD

as part of the

custom_build

plugin?

happy-kitchen-89482

11/06/2025, 10:24 PM

One could, but would one need to? What are the inputs from which these would be generated? In this contrived scaling example we have to get the dependencies from somewhere since there is nothing to infer them from, but in your real-world use case aren't the deps supposed to be inferred from the sources?

fast-school-44220

11/06/2025, 11:35 PM

Yes, deps are generally inferred. But enumerating the list of sources could use the help.

happy-kitchen-89482

11/06/2025, 11:41 PM

I guess my point is that if the information needed to generate the build/BUILD files is already in the repo, then why take the trouble to write it out into a checked in file? You could just infer it at build time

happy-kitchen-89482

11/06/2025, 11:41 PM

those auxiliary build files should be for things that can't be inferred and have to be checked in

happy-kitchen-89482

11/06/2025, 11:42 PM

tailor

is just an affordance for the fact that "traditional Pants" relied on the existence of some sort of stub BUILD file in various ways. But in a custom plugin you're not tied to that.

fast-school-44220

11/07/2025, 12:03 AM

Hmm. That's a really good idea. Thanks.

12 Views

Open in Slack

Previous Next