I'm trying out Pants 2.13 and I'm seeing a disappo...
# general
f
I'm trying out Pants 2.13 and I'm seeing a disappointing performance regression in my fundamental "let's see what's changed" CI command:
Copy code
./pants --changed-since=HEAD --changed-dependees=transitive list
It's gone from about 1 min to 2 min from 2.11.0 to 2.13.0a0. I can reproduce these times pretty consistently by flipping between branches with the different versions and then killing pantsd and nuking the lmdb_store and named_caches. More details in 🧵
h
Try 2.13.0a1, and make sure
--owners-not-found-behavior
is not used: https://github.com/pantsbuild/pants/pull/15931
(PS thanks for reporting - it is super helpful to get feedback like this)
💯 1
f
Okay I'll give that a try. Running analysis on a related issue where the pantsd cache seems to miss for... no reason. Like I just run the same command (with a
| xargs something-else
) tacked on one line later and that second command has to do the whole dependee check again
h
I wonder if
xargs
is messing things up. In the past, I vaguely recall it messing up things
f
Here's the code that can reproduce that cache miss:
Copy code
time ./pants list --changed-since=HEAD --changed-dependees=transitive > artifacts/transitive.targets.txt
time ./pants --spec-files=artifacts/transitive.targets.txt filter --target-type=python_test --granularity=file '--tag-regex=^unit$' '--address-regex=-^pants-plugins'
time ./pants --spec-files=artifacts/transitive.targets.txt filter --target-type=python_test --granularity=file '--tag-regex=^local$' '--address-regex=-^pants-plugins'
time ./pants --spec-files=artifacts/transitive.targets.txt filter --target-type=python_test --granularity=file '--tag-regex=^systest$' '--address-regex=-^pants-plugins'

time ./pants --changed-since=main --changed-dependees=direct list \
      | xargs ./pants filter --target-type=test_component \
      | xargs ./pants dependees \
      | xargs ./pants filter --tag-regex='^systest-long$' \
      | xargs ./pants source-files

time ./pants --changed-since=main --changed-dependees=direct list \
      | xargs ./pants filter --target-type=test_component \
      | xargs ./pants dependees \
      | xargs ./pants filter --tag-regex='^systest-multicloud$' \
      | xargs ./pants source-files
That last command is exactly like the one before it, except it uses a different tag to filter on (I know filtering is deprecated, haven't gotten around to changing it yet.) But the cache misses completely on
xargs ./pants dependees
in that last command and it has to recompute the entire dep graph (which takes another 2m20s 😅 )
h
btw in 2.13, no need for
filter
goal -- you can use
--filter-target-type
etc from anywhere 🙂
f
Yeah hadn't gotten around to that yet
obviously i could rewrite this to use a spec file in this case, it's just odd to get such a nasty cache miss
timings of last two commands:
Copy code
real    0m11.609s
user    0m3.131s
sys     0m0.389s

real    2m44.740s
user    0m3.125s
sys     0m0.389s
w
which was cherry-picked for a1.
so definitely worth trying.
f
doing so now, just wanted to finish dumping the info on the behavior I observed
👍 1
w
yea, thanks: a repro case will be very helpful if there is anything left over
f
2.13.0a1 helped a bit but still 😕
Copy code
real    1m27.762s
user    0m0.620s
sys     0m0.076s
And that cache miss happened again
w
mm, thanks. please open a ticket with the repro case!
f
I dont think I'm using
--owners-not-found-behavior
because I don't specify anything related to it
👍 1
h
We're definitely interested in figuring that out. I could not personally find a regression between 2.12 and 2.13 last week, so agreed that whatever repro you'd be able to report would help a ton
f
It's a regression between 2.11 and 2.13 (I skipped 2.12), and do you have any tips on the repro? I can't ship our whole repository, but I will say that we abuse recursive globs for file targets at the moment. Maybe generating random files would do it. I really don't have a lot of time between now and Fri to come up with a good repro case, but I'll try
w
just the commands should be sufficient.
this type of slowdown tends to be of an algorithmic big-O variety, so it won’t reproduce as strongly in a smaller repo… but it will still reproduce.
f
It's those commands I listed above
So I can reproduce the regression just with
./pants dependees ::
I've also noticed that I get basically no caching performance gain at all without pantsd for that command, i.e.,
./pants --no-pantsd dependees ::
is consistently slow; I would have expected some of the work would be cached to disk, so I find this surprising
Copy code
❯ for i in (seq 1 10)
      echo "Trial #$i"
      time ./pants --no-pantsd dependees ::
  end
Trial #1
⠴ 57.09s Searching for `bash` on PATH=/usr/bin:/bin:/usr/local/bin
⠴ 70.27s Map all targets to their dependees

________________________________________________________
Executed in  115.58 secs    fish           external
   usr time  138.44 secs  333.00 micros  138.44 secs
   sys time   15.15 secs   82.00 micros   15.15 secs

Trial #2

________________________________________________________
Executed in   95.70 secs    fish           external
   usr time  114.68 secs    0.00 micros  114.68 secs
   sys time   12.70 secs  533.00 micros   12.70 secs

Trial #3

________________________________________________________
Executed in   90.90 secs    fish           external
   usr time  108.60 secs    0.00 micros  108.60 secs
   sys time   11.85 secs  515.00 micros   11.85 secs
so there is a bit of a speedup... something is being cached
okay... I think I may have found the issue: it's probably my (ab)use of recursive globs for target defs. If I kill my big recurisive glob targets and run
./pants tailor
I get a major speedup
Copy code
❯ time ./pants --no-pantsd dependees ::

________________________________________________________
Executed in   50.06 secs    fish           external
   usr time   60.92 secs  769.00 micros   60.92 secs
   sys time    7.21 secs  901.00 micros    7.21 secs
Anyways... this whole dependency + cache thing is frustrating. I feel like one of the reasons I liked Pants was because of dep inference, but without good caching around it, it almost feels like a solution similar to Bazel + Gazelle makes more sense. I was kinda counting on decent disk-based (or Toolchain-provided remote) caching as a solution to this eventually, but if it requires pantsd to cache this stuff then that's a serious drawback
w
there are actual processes that run to extract dependencies on a file-by-file basis: those will be cached to disk across runs, regardless of
pantsd
. but consuming the outputs of those processes to actually construct the graph is only memoized by
pantsd
bazel/gazelle wouldn’t cache the relevant portion of this either: you can think of the unmemoized portion in
pantsd
as approximately equivalent to “reading the BUILD files from disk”
f
why does that take 50 sec?
w
and “determining whether gazelle is up to date” is the portion that is cached to disk in pants: i.e., “is the BUILD file updated for the current source file”
f
(I'm sorry, I'm just really frustrated today, and I'm gonna get hammered on things like this in the coming weeks as I try to push this project forward)
w
why does that take 50 sec?
good question! gathering a trace of the usecase would be useful. Toolchain has infrastructure for collecting the “workunits” via a plugin… in the absence of that, i think that Joshua Cannon had also worked on something to collect and aggregate the workunits
f
no wait... it takes 20 sec
./pants --no-pantsd dependees ::
if I get rid of all recursively defined targets and replace them with 111 build file defs
w
hm… that’s an interesting datapoint. if anything, i would think that fewer targets would take less time to match in that case. worth following up on.
1
f
so that's 2 min to 20 sec, and probably something that points to a superlinear big-O thing on "number of targets per generator/owner"
1
w
yea. all of this is stuff that we’re very well positioned to investigate in 2.13, because this code has changed quite a bit
f
I'll still write a bug on this and try to write this up
w
yes please. these will be blocking issue(s) for 2.13.
f
i would think that fewer targets would take less time to match in that case. worth following up on.
I didn't have fewer targets, I had fewer owners. I was using a bunch of
sources=["dir/**/*.py"]
👀 1
h
maybe we're overevaluating globs or something
@flat-zoo-31952 we could revisit getting you set up with Toolchain's buildsense UI, because that would give us very fine-grained traces we can use to debug this
f
Sounds nice, but I don't have the time to push through a 3rd-party vendor integration right now (through security concerns, politics, etc); would love to discuss that eventually
I might try 2.12 as well, but I originally went to the 2.13 branch because I was able to build it for ARM. I can probably cherry pick some things to a forked 2.12 branch to make ARM buildable on 2.12. I'll talk with my team tomorrow and see what we can do
w
yea, possibly. if you encounter CLI arg performance issues on 2.12 though, it will be quite difficult to cherry-pick fixes there. 2.13 will be a much easier target for fixes.
(because these codepaths changed quite a bit between 2.12 and 2.13)
f
I'll consider that. The performance is so much better with 111 .BUILD files that I may just try to either 1) ignore the performance issue or 2) ignore ARM and stick with 2.11 ... until a few weeks from now when I can put .BUILD files in the repo everywhere
Using one giant .BUILD file is bad but I need
./pants tailor
working locally on all dev machines before I can get rid of it
👍 1
w
yea, that’s something we’d be interested in looking at too. we can tackle these with some additional urgency: we definitely consider them to be broken windows.
filing the one(s) that are blocking you would be helpful.
🙏🏻 1
🙏 1
f
I will try to file these soon. I'm just slammed this week (trying to actually get Pants running on 10% of our pipelines before I leave on vacation, and trying to tie up a bunch of loose ends to do this)
🤞 1
w
good luck! let us know how we can help.
f
So I did some playing around with this, including trying to randomly generate some files, but I'm not able to reproduce this issue outside our actual repository. I've tried randomly generating large source trees to simulate this, but it hasn't worked. If I can't make a reproducer outside of a repo that I can't share, I'm not sure what good a bug report will do 😕
-ldebug
has some relevant info... this is an excerpt, looks to me like it's stalling on getting a sandbox to look for python binaries?
Copy code
14:24:36.68 [DEBUG] Launching 1 roots (poll=false).
14:24:37.83 [DEBUG] Completed: Find all targets in the project
14:24:47.25 [DEBUG] Completed: acquire_command_runner_slot
14:24:47.25 [DEBUG] Running Searching for `python3.10` on PATH=/usr/bin under semaphore with concurrency id: 5, and concurrency: 1
14:24:47.25 [DEBUG] Completed: acquire_command_runner_slot
14:24:47.25 [DEBUG] Running Searching for `python3.9` on PATH=/usr/bin under semaphore with concurrency id: 6, and concurrency: 1
14:24:47.26 [DEBUG] Completed: Extracting an archive file
14:24:47.26 [DEBUG] Completed: pants.core.util_rules.external_tool.download_external_tool
14:25:48.97 [DEBUG] Completed: setup_sandbox
14:25:48.97 [DEBUG] Completed: setup_sandbox
14:25:48.97 [DEBUG] Obtaining exclusive spawn lock for process since we materialized its executable Some("./find_binary.sh").
14:25:48.97 [DEBUG] Obtaining exclusive spawn lock for process since we materialized its executable Some("./find_binary.sh").
14:25:57.63 [DEBUG] spawned local process as Some(122486) for Process { argv: ["./find_binary.sh", "python3.10"], env: {"PATH": "/usr/bin"}, working_directory: None, input_digests: InputDigests { complete: DirectoryDigest { digest: Digest { hash: Fingerprint<1824155fc3b856540105ddc768220126b3d9e72531f69c45e3976178373328f3>, size_bytes: 91 }, tree: "Some(..)" }, nailgun: DirectoryDigest { digest: Digest { hash: Fingerprint<e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855>, size_bytes: 0 }, tree: "Some(..)" }, input_files: DirectoryDigest { digest: Digest { hash: Fingerprint<1824155fc3b856540105ddc768220126b3d9e72531f69c45e3976178373328f3>, size_bytes: 91 }, tree: "Some(..)" }, immutable_inputs: {}, use_nailgun: {} }, output_files: {}, output_directories: {}, timeout: None, execution_slot_variable: None, concurrency_available: 0, description: "Searching for `python3.10` on PATH=/usr/bin", level: Debug, append_only_caches: {}, jdk_home: None, platform_constraint: None, cache_scope: PerRestartSuccessful }
14:25:57.86 [DEBUG] spawned local process as Some(122487) for Process { argv: ["./find_binary.sh", "python3.9"], env: {"PATH": "/usr/bin"}, working_directory: None, input_digests: InputDigests { complete: DirectoryDigest { digest: Digest { hash: Fingerprint<1824155fc3b856540105ddc768220126b3d9e72531f69c45e3976178373328f3>, size_bytes: 91 }, tree: "Some(..)" }, nailgun: DirectoryDigest { digest: Digest { hash: Fingerprint<e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855>, size_bytes: 0 }, tree: "Some(..)" }, input_files: DirectoryDigest { digest: Digest { hash: Fingerprint<1824155fc3b856540105ddc768220126b3d9e72531f69c45e3976178373328f3>, size_bytes: 91 }, tree: "Some(..)" }, immutable_inputs: {}, use_nailgun: {} }, output_files: {}, output_directories: {}, timeout: None, execution_slot_variable: None, concurrency_available: 0, description: "Searching for `python3.9` on PATH=/usr/bin", level: Debug, append_only_caches: {}, jdk_home: None, platform_constraint: None, cache_scope: PerRestartSuccessful }
14:25:58.23 [DEBUG] Completed: Searching for `python3.9` on PATH=/usr/bin
14:25:58.23 [DEBUG] Completed: Searching for `python3.10` on PATH=/usr/bin
14:25:58.23 [DEBUG] Completed: Scheduling: Searching for `python3.10` on PATH=/usr/bin
14:25:58.23 [DEBUG] Completed: Scheduling: Searching for `python3.9` on PATH=/usr/bin
h
oh that's an interesting angle, if the slowdown isn't from our Specs changes but instead from something else
f
There's like a gap of a minute in there where it's not doing... anything? Possible race condition? This is all before the dep calculation starts, too, and dep calc performance is actually really reasonable according to these logs
👍 1
h
Are you able to share the logs as a gist? Fine if over DM. (Although probably not to me, I'm out all week finishing a move)
f
yeah I can scrub them a bit and share them as a gist
i'm curious, is there a way to configure logging to send different levels to different streams/files?
now I'm running into what looks like a hang when I attempt to redirect stderr to collect the logs, either to a stream or to a file
w
@flat-zoo-31952: it’s possible that piping into a file that
pantsd
is watching is causing it to constantly restart
(or is this
--no-pantsd
…?)
f
it is pantsd, but I've reproduced it with a file outside the tree (I think)
w
i’ll try to repro. but maybe pipe into an ignored location like
.pants.d
,
dist
or outside of the tree
f
can repro with
2> /var/tmp/some-file.log
w
yea, doesn’t look like an invalidation issue.
f
I can't repro outside our repository but i'm adding debug and trace level logs
Let me know if there's something else I can do
w
given that the deadlock is just in the context of trying to debug the actual performance issue, i’ll probably start by looking at the performance issue. sorry for all of the trouble!
f
I can repro the deadlock on stderr redirection in the pants repo itself
I haven't been able to repro the performance issue there, so I wouldn't assume they're related
Thanks for looking into it, sorry it took me a bit to get proper bug reports