Hi everyone. I've go that performance scaling cas...
# development
f
Hi everyone. I've go that performance scaling case up on github now. https://github.com/NVIDIA/Dependency-Patterns/tree/main/scaling
pants list ::
takes about 2 minutes reports 163,375 targets.
pants run :go
(which is just a verb I picked, it has nothing to do with the "go" language or tools) runs for days and grows to fill all available memory until IT kills the process for hogging farm resources.
👍 1
This is the case I was referencing on the other thread about free thread Python.
I suspect it might be an issue specific to the adhoc_tool backend, but it might also be in the transitive dependency logic that's used across the board.
w
That's.... That's a lot of targets. Is that the correct amount of targets? Or is it adding spurious ones?
f
This is a real build DAG from our current system at the directory level. Obviously the work in each directory is synthetic and hugely simplified over the real thing.
w
👍
h
What is the hardware this is running on? I get 20-30s for
pants list ::
in that repo
still more than I would like
I would like to reimplement all the graph stuff in rust
f
I'm guessing the
pants list ::
might be IO bound and I'm on NFS storage.
h
I'm on macOS on a 2022 macbook pro, for reference
f
I'm on whatever VM IT feels like provisioning for me. 🙂
h
The
pants run
cmd is running for a few minutes now, I don't see a memory leak, but you said it takes days so maybe it's too slow to notice yet
f
it's a slow grow, but I'd expect it to kick in around 10-15 minutes in.
h
I'll keep watching it
OK, I'm seeing it
So there are two issues: 1) memleak, 2) it takes days ?
f
Well, 2 symptoms, but I'm guessing they're a single underlying issue.
h
Is there an issue open for this? Probably best to discuss there? it's less ephemeral, and this is a fun one
f
Not yet. I'll do that this afternoon after my meetings.
h
I want to figure out what the real solution is. Even if we fix the memory leak, I'm not sure the execution time will improve much if running ~160K processes on a single machine
So let's discuss on an issue (and I also want to be sure I understand what real thing these dummy processes are modeling)
OK, that leak started to run away quite rapidly and then things crashed
This is after ~45 minutes
So this may not a memory leak per se, so much as the fact that the in-memory state of the build graph grows as rules execute, and this is a very large build graph. It would be interesting to see which data structures are involved.
f
FWIW, bazel does not fork bomb the machine and it completes in 9-10 minutes on my system.
Changing from
shell_sources
to
files
and running with
--no-pantsd
has made it get way further than it ever did before. Still using a ton of memory, but I grabbed a big machine in the farm, so that's ok.
Copy code
2426815 svenier   20   0  317.0g  48.1g 335500 S 100.0   2.1  81:23.92 python3.11
would it be reasonable to change https://github.com/pantsbuild/pants/blob/06e2167524db3406c56b90081b48bc350eec6721/src/python/pants/core/util_rules/adhoc_process_support.py#L342 to only look at direct dependencies rather than transitive (on an opt-in basis)? I suspect that would really speed up my case (with no evidence).
w
Not speaking to the adhoc thing, but Benjy updated the docs and the next version is here re: call-by-name: https://github.com/pantsbuild/pants/blob/main/docs/docs/writing-plugins/the-rules-api/rules-and-the-target-api.mdx
gratitude thank you 1
f
in that page the call to resolve_targets has the arguments passed inside the call to implicitly() instead of before it. Is that intentional?
hmm.. I see it elsewhere on the page too, so I'm going to assume it is correct. It just looks strange to see implicitly used differently in this one case.
w
There are different
implicitly
use cases depending on what’s being ported over. There are 4 modes of “Get” syntax that were ported over. But, it’s also very 80/20 on what gets used where. https://github.com/pantsbuild/pants/blob/06e2167524db3406c56b90081b48bc350eec6721/src/python/pants/goal/migrate_call_by_name.py#L382 I think Benjy might have made mention to it somewhere in the docs, but that reminds me, it should be front and centre as a doc comment
h
Ah that is interesting. You could be spending unnecessary time materializing indirect dep sources into each sandbox, when you only need the sources of the immediate dep. In the general case we do need the transitive deps, as we don't know what the ad-hoc process intends to use, but it would make a lot of sense to optionally allow just direct dependencies if you know those are sufficient.
Although you do still need to run the codegen for the transitive deps.
f
Turns out that the system commands need the transitive deps. So we'd have to add some kind of flag for "stop here" to deps that we don't want to recurse on
Strike that. Turns out that the cursor generated anonymization script goofed up the path to the shell script. After fixing that it looks like it's running just fine with direct dependencies only.
I'll be refreshing the branches once this finishes running.
ok, switching from transitive deps to direct deps took the runtime down from 7 hours to 1 hour, 11 minutes.
h
Nice!
I am going to hack on something today that I'm hoping will bring this down much further. I will provide more details once I have something promising.
w
@fast-school-44220 what’s the memory looking like by the end?
f
At the moment:
MAX MEM: 42945 Mbytes; AVG MEM: 40012 Mbytes;
but it hasn't been running for THAT long yet. I'll report back when it finishes.
oh, actually, maybe I have a run in a different directory... let me check...
This is with the files swap, but not the direct dependencies (in my non-anonymized directory, so I can't share it as is):
CPU time : 44110.98 sec.
Max Memory : 51362.64 MB
Average Memory : 49617.44 MB
w
It might cause everything to explode, but we have some options to emit stats to disk. I'd be curious to see that file. https://www.pantsbuild.org/stable/reference/subsystems/stats
I also can't imagine how big that file would be in the end
f
heh. I've got to run for family stuff, but I'll try it out tomorrow.
w
👍
f
Yesterday's run did complete successfully. 3.25 hours, max memory 43922.92MB This was with files() rather than shell_sources() and only direct dependencies traversed and populated into the sandboxes. I'll push this to github after lunch and then kick off the stats run.
one thing I'm not sure about now that I've read the stats page is how the cache has been impacting things. so that will be interesting to see.
when run against a fully populated cache, with stats enabled, it took 325 seconds
w
As in, the no-op (effectively) case was 325 seconds?
f
yup. Not sure how much of that was stats overhead
w
wow
That's something... It does hit upon something I was debugging yesterday, where pulling in local cache was taking a bit longer than what I expected (though, not sure exactly what numbers I expected - it just took "longer")
Granted, we're talking about scales of magnitude differences
f
looks like there's nothing proprietary in the stats file if you'd like to see it.
w
Yeah, if you can upload it, or if concerned, you can dm it to me
f
let me just get a quick 2nd set of eyeballs on it to make sure I didn't miss anything
👍 1
w
Out of curiousity - have you ever checked out what's in a sandbox when you run this? I don't know if it's relevant,but just wondering if there is extra materializing going on - other than all the other bottlenecks, obviously 🙂
f
certainly not exhaustively, but I did spot check a few when I was initially developing it. They're pretty minimal.
they should be even smaller, now that I've switched to direct dependencies only
w
👍 Figured, just checking. That was another thing I noticed in a very specific perf workflow I'm working on. it materializes more than we need
f
the shell_to_file branch of github.com:NVIDIA/Dependency-Patterns.git is now up to date with what I've actually been building.
c
(Just as a general perf tip that may or may not be applicable here: It helps with the sandbox dir, cache, and repo are on the same file system so that hardlinking is possible.)
f
heh. well, I've taken that number down from 3 to 2, but I don't think it'll ever be 1
and really, if/when this hits deployment at scale it'll be running on a BB or similar farm.
what is the granularity of the cache entries? Just external process runs or are higher level things like entire rule invocations cached?
huh. I'm running again w/o the stats file and it's running longer. But this time I didn't run it to a logfile. Apparently that dynamic terminal io is causing some significant perf overhead?
w
Hard to say, but sometimes changing parts of the pants.toml or the command line args will trigger some re-work. I thought it was mostly related to startup, but I can't say crisply how limited that may be
f
I ran it again without stats and it was 336 seconds this time. So, the stats overhead is negligible. that difference is in the noise of our farm hosts.
h
FWIW I wrote a custom plugin that bypasses a lot of heavyweight Pants cruft that isn't really pertinent to this use case (e.g., fully materializing a BUILD graph), and I'm currently profiling it
More tomorrow or over the weekend
w
Did you have to monkey patch it in?
h
Not so far, it's a straightforward plugin,
w
Ah, I'm working on some perf plugins, and it's monkeypatch city over here
h
I am seeing some absolutely bonkers things in the cpu profiles. Will need to do this on a linux machine to get native code in the pyspy profiles though, that isn't supported on macos.
f
I'm happy to take a pre-alpha release and help with cleanup. Assuming you don't mind letting a perl guy at your code. 😁
h
f
Dude, rock and roll!
@happy-kitchen-89482 reading your custom_build plugin... the cache is doing some heavy lifting there. If two different directories depend on the same one, they're both going to try to build it and you're counting on the 2nd one hitting on the cache. What if it's not done yet? Is there a race condition?
h
Just to clarify terms, there is the on-disk cache of process results, and there is in-RAM memoization of "tasks" (rule executions, more or less). The latter does not survive a pantsd restart, the former does. Which is why as long as pantsd is alive a no op rebuild is 300ms, whereas after killing pantsd it takes 10 seconds to reestablish that in-memory state by reexecuting all the rules but reading the process results from cache.
So anyway, what we're relying on is memoization. There is no race condition. If a task is requested multiple times while still in flight it is only executed once. If it's requested again after having been memoized, you get that memoized result.
👍 1
tokio futures FTW
But presumably this is what you want?
Rules are supposed to be pure, with no side effects
f
Yes, exactly what I want
h
For example this means that if you run your codegen or preprocessor in order to do dep inference, it won't run again during the build
(if you had them be separate passes for some reason, but my point in that demo is that they probably shouldn't be)
f
hmm. My run crashed:
Copy code
08:52:58.41 [ERROR] 1 Exception encountered:

Engine traceback:
  in `custom-build` goal

IntrinsicError: An error occurred when attempting to materialize a working directory at "/tmp/pants-sandbox-93xFX3": Was not present in the local store: Digest { hash: Fingerprint<83e9a74e8be0d776d3d741eb3c600beb6917365cec18fb4e3e6b20c0019ee778>, size_bytes: 32 }
That kind of reads like a race condition.
Though, I suppose it could be the result of running inside an NFS directory? (My workspace and ~/.cache, not /tmp)
I deleted ~/.cache/pants and now things work correctly. Must have just been corrupted from a previous killed run?
so total from scratch build was just under 14 minutes
h
Hmm, we see those in CI from time to time, they are not well-understood at present
Interesting that it is so much longer than on my local machine, but I'm guessing NFS is part of it?
f
Yeah, that's my assumption. I'm tempted to check it out into local storage in /tmp, but given that our real work has to happen in NFS land for Reasons, I'm not sure it's worth the effort.
@happy-kitchen-89482 could one conceivably make a
tailor
goal to generate your
build.json
files instead of
BUILD
as part of the
custom_build
plugin?
h
One could, but would one need to? What are the inputs from which these would be generated? In this contrived scaling example we have to get the dependencies from somewhere since there is nothing to infer them from, but in your real-world use case aren't the deps supposed to be inferred from the sources?
f
Yes, deps are generally inferred. But enumerating the list of sources could use the help.
h
I guess my point is that if the information needed to generate the build/BUILD files is already in the repo, then why take the trouble to write it out into a checked in file? You could just infer it at build time
those auxiliary build files should be for things that can't be inferred and have to be checked in
tailor
is just an affordance for the fact that "traditional Pants" relied on the existence of some sort of stub BUILD file in various ways. But in a custom plugin you're not tied to that.
f
Hmm. That's a really good idea. Thanks.