I'm on release duty today for 2.16.0.dev5. It look...
# development
e
I'm on release duty today for 2.16.0.dev5. It looks like there are no unhandled deprecations; so I'll just proceed with a bump + changelog PR this afternoon.
It looks like 2.14.1 is probably ready to graduate - its been a few weeks at rc0; so I'll do that unless someone speaks up.
👍 1
And 2.15.x has a crop of new picks; so an rc2 looks like its in order soon. It's known that running directly from python sources has a bug - seen by both Color & Toolchain: https://github.com/pantsbuild/pants/issues/17987 and that has no fix or even root cause yet; so rc2 will not be the last. I could hold off on that one then unless folks see a need to publish rc2.
w
yea, fine with holding off on another 2.15.0rc.
thank you!
e
I get a lot of timeouts for pantsd starting (60s) on both 2.14.1 and 2.16.0.dev5: https://github.com/pantsbuild/pants/actions/runs/3919468305/jobs/6702028543#step:8:237 I think I'm doing my ~3rd restart on those shards for both. Always / only macOS 11 x86_64 - which is a GH hosted runner. Is this a well known experience?
h
Seeing them too, and it's not well-known to me anyway.
e
@happy-kitchen-89482, so I think the timeouts I'm seeing are always macOS 11 x86_64 Python 3.7 ... why are wheels built on macOS 11 x86_64 at all? We also have a macOS 10.15 x86_64 shard and it builds for Python 3.7 -> 3.9, which should suffice for x86_64 afaict.
Yeah, right now - pre Linux ARM, we publish 10 wheels. We should only be publishing 7 unless I misunderstand something about 10.15 forward compat.
h
You're probably right.
e
Should be 10.15 x86-64 x 3.{7,8,9}, Mac ARM, Then the same 3 for Linux x86_64
Ok, well just killing the shard is punting on a real problem. I need to figure out the timeouts.
60s to start - even in a resource constrained env, is a bad bad sign.
h
The history, if I remember it correctly, was: Github Actions EOL-ed macOS 10.15 runners, so we switched those shards to macOS 11, then decided that we didn't want to abandon 10.15 support just because of a CI resource issue, so set up self-hosted runners for 10.15, without then reasoning about whether we could drop the macOS 11 ones.
e
Ok, yeah - that makes sense.
h
So the macOS 11 wheel shard exists from intertia, essentially
But yeah, we should figure out the timeouts regardless, because I'm sure they will bite us another way, I'll take a look.
Weird that the test shard, running on the same runner type, does not have this issue
e
It's specifically in the testutil wheel test, I'm drilling there.
Ah no, that's wrong, it's just after that.
h
I have CI run set up that will let me SSH into the box, will poke around manually and see if I can see anything suspicious
I'm looking at https://github.com/pantsbuild/pants/actions/runs/3919468305/jobs/6702028543 for reference, I think it's failing in
USE_PY39=true ./build-support/bin/release.sh build-local-pex
e
Ok, and definitely not always 3.7, the 60s pantsd timeout happens randomly in substeps depending on the run fwict.
Yeah @happy-kitchen-89482 I think which step it fails in is probably not significant now.
Yeah, confirmed. I've just checked a few of my 7 runs and the substep / Python interpreter involved varies. The crash is always too slow pantsd startup though.
h
Urgh, that will make it harder to reliably reproduce
This continues to be so excellent though: https://github.com/marketplace/actions/debugging-with-ssh
e
So, the 2.14.1 case is particularly interesting since there have been no intervening CPs and that was a while ago: https://github.com/pantsbuild/pants/actions/runs/3808683113/jobs/6479467989 That run has 0 instances of "waiting for pantsd to start..." let alone repeats.
So that implies, afaict: 1. pantsd startup hits something remote that can stall it that used to be snappy and now is not. 2. the GH runner or its software or ram / etc has changed in some significant way.
I can't think of anything else.
Ah, that ssh does look useful. It would be awesome if it were built-in like CircleCI, but good enough. I'll await your perusal of .pants.d logs for anything interesting.
h
So far, anecdotally, rust compilation is slow as hell.
One thing that changed recently is that, for unfortunate reasons, we now need the Go toolchain to be present when building wheels, so we have a step that installs it. It's a bit of a reach. Why would that matter, and why only on this platform.
e
Yeah, seems like slow X, bigger Y, etc should not have an effect on a pure stratup of pantsd with 0 cached lmdb.
Slower filesystem accesses - by alot - would do it.
h
It's possible that everything is slower now on this platform, but pantsd startup is the one that happens to have a short timeout.
e
I mean 60s!
The link to your rc doesn;t even show a single wait
h
Yeah, it's pathological
e
So no interesting in .pants.d logs?
h
I'm still waiting for rust to compile...
e
Ah, ok you don;t get access to the live / past job?
h
I happened to put the "stop the world and run an ssh daemon" step early, before that build. I could change that of course, but I wanted to see what it looked like to run that step interactively.
e
Gotcha
h
I wonder if the SSH daemon itself is affecting things. I'm seeing slow rust builds, and those didn't seem particularly slow in the CI that failed.
Well, 6m rust build, I guess it's not that bad
e
Yeah, that seems right for the CI boxes.
h
OK, weird, it's now running Pants, and it's sluggish as hell, really really slow
But - pantsd did start up with no waits
e
I'm going to try inserting ssh post and flip the prior step to not fail the build.
h
Where I commented out all the jobs except the one we care about
if that's helpful
OK, if I kill and restart pantsd it does take a few seconds to start up, but not 60s
There are quite a few filed issues around runner performance variance, e.g., https://github.com/actions/runner-images/issues/3885
e
Ah, ok - missed your crib. That damn generator script is not my friend. I hand-edited for the 1st go here.
I just swore when pantsd started at the 55s mark.
h
dammit
I'm just seeing everything slow on the runner
e
Still 3 chances left.
h
I ran some little CPU benchmark and it was more than twice as slow as on my m1
Which is admittedly not apples to apples
e
That's not even bad though.
As you said, this is pathological and 2x is not that.
h
Yeah, wondering about filesystem access
e
Feels like it has to be network or disk, but we'll see here. Hopefully .pants.d logs will bear fruit.
h
I am seeing pantsd restart between runs
I see nothing unusual in the log
Yeah, I think this is probably the same thing people at large are seeing (e.g., at issue I linked above) - huge variance in macos runner performance, with github folks attributing that to mysterious infrastructure issues
e
There was nothing interesting in .pants.d logs ... disk appears local, this is a VMWare controlled box.
Thanks for the ssh trick - definitely useful to have that ability.
The oft-talked about dream back in ~95 before the Pawtuckaway bouldering guide was out was to have the cover be a shot of a climbing rope burning at the foot of one of the boulder problems. This never happened due to concerns about burning a large quantity of plastic / darwin awards and the like. Similar concerns give me pause in my current desire to burn a mac,
h
Why would one want to burn climbing rope? As in "boulderers don't need no stinkin' rope"?
e
Exactly
h
Maybe reenacting the printer scene from Office Space, but on a mac, would be safer, and possibly more satisfying
e
Yeah, better for a box computer and vacuum tube monitor, but maybe the best option.
So the ssh was super useful but totally bore no fruit.
Even with all env vars identical to CI, could not repro after several attempts - 0 5-sec wait prints even.
I'm going to leave this alone for the night and see what the morning tells me.
h
A confounding issue may be that there is a lot of variance depending on which runner you get
I just tried again on another runner and it seemed slower than before
e
2.14.1 had actually finally completed all steps requiring a pantsd start but got auto-cancelled for exceeding 90 minutes 😕 ... I'll keep mashing retries and not try to do surgery until these two releases go green.
h
This situation is not improving. Shot in the dark - I'll try and revert the Go install, just on the unlikely chance that it was the cause.
Nope - I rewound the repo to a commit that had all its wheels build uneventfully a couple of days ago, and no dice today.
What's weird is that the test shard on the same platform type seems to run fine
Oh, here is an interesting tidbit: https://github.com/actions/runner-images/issues/6798
Someone has specifically noticed macos 11 runners failing to allocate ephemeral ports in a reasonable amount of time.
Which could be why we're seeing effects specifically on pantsd startup
e
Alright, I still don't feel to great about combining two things here, but since the GH macOS 11 x86_64 wheels are redundant right now + actively hostile to PyPI, I'll do up a PR to rip those wheel builds.
h
Yeah, I am forced to agree
We can continue to drill into this in a branch, or not, but we have to make forward progress on other changes, and this is an absolute blocker
We might have to give up on hosted macos and run our own exclusively
even if we have to pay for another self-hosted macstadium runner
e
For now I've just done the minimal CI subtraction for the wheel builds.
I'm also spinning up a new 2.16.0.dev5 run with a timer around the ephemeral port bind to check your theory.
That does not appear to be it in any smoking gun way at all. Sub ms:
Copy code
00:18:25.38 [INFO] waiting for pantsd to start...
00:18:30.37 [INFO] waiting for pantsd to start...
00:18:35.38 [INFO] waiting for pantsd to start...
00:18:40.38 [INFO] waiting for pantsd to start...
00:18:45.38 [INFO] waiting for pantsd to start...
00:18:50.49 [INFO] waiting for pantsd to start...
00:18:55.38 [INFO] waiting for pantsd to start...
00:19:00.37 [INFO] waiting for pantsd to start...
00:19:05.46 [INFO] waiting for pantsd to start...
00:19:10.34 [INFO] waiting for pantsd to start...
00:19:15.37 [INFO] waiting for pantsd to start...
Traceback (most recent call last):
  File "/Users/runner/work/pants/pants/src/python/pants/bin/pants_loader.py", line 127, in <module>
    main()
  File "/Users/runner/work/pants/pants/src/python/pants/bin/pants_loader.py", line 123, in main
    PantsLoader.main()
  File "/Users/runner/work/pants/pants/src/python/pants/bin/pants_loader.py", line 110, in main
    cls.run_default_entrypoint()
  File "/Users/runner/work/pants/pants/src/python/pants/bin/pants_loader.py", line 92, in run_default_entrypoint
    exit_code = runner.run(start_time)
  File "/Users/runner/work/pants/pants/src/python/pants/bin/pants_runner.py", line 89, in run
    return remote_runner.run(start_time)
  File "/Users/runner/work/pants/pants/src/python/pants/bin/remote_pants_runner.py", line 114, in run
    pantsd_handle = self._client.maybe_launch()
  File "/Users/runner/work/pants/pants/src/python/pants/pantsd/pants_daemon_client.py", line 38, in maybe_launch
    return self._launch()
  File "/Users/runner/work/pants/pants/src/python/pants/pantsd/pants_daemon_client.py", line 62, in _launch
    pantsd_pid = self.await_pid(60)
  File "/Users/runner/work/pants/pants/src/python/pants/pantsd/process_manager.py", line 309, in await_pid
    self.await_metadata_by_name(
  File "/Users/runner/work/pants/pants/src/python/pants/pantsd/process_manager.py", line 230, in await_metadata_by_name
    self._wait_for_file(file_path, ongoing_msg, completed_msg, timeout=timeout)
  File "/Users/runner/work/pants/pants/src/python/pants/pantsd/process_manager.py", line 176, in _wait_for_file
    return cls._deadline_until(file_waiter, ongoing_msg, completed_msg, timeout=timeout)
  File "/Users/runner/work/pants/pants/src/python/pants/pantsd/process_manager.py", line 148, in _deadline_until
    raise cls.Timeout(
pants.pantsd.process_manager.ProcessManager.Timeout: exceeded timeout of 60 seconds while waiting for pantsd to start
>>> Nailgun server port bound in TimeSpan {
    start: Duration {
        secs: 1673828361,
        nanos: 88973000,
    },
    duration: Duration {
        secs: 0,
        nanos: 155000,
    },
}
h
Yeah, I did a similar experiment with a toy Python server using bind(0) and got nothing
e
Hrm, Linux x86_64 now: https://github.com/pantsbuild/pants/actions/runs/3926107038/jobs/6711525702#step:9:761 I am a bit worried we have our own pathology.
h
This is starting to feel like the gremlin on the wing
e
I honestly think this is all recent belt tightening by Microsoft / oversubscription. That's my guess and I'm sticking to it. The fact the Linux shards (which are hosted on Azure now) also show many "waiting ..." lines often now pushes me there, since we have the wonderful null case of 2.14.1.
Ok, morning dawns and I was being dumb. The 2.14.1 case is still main - I will forever find branch release notes on main confusing. So it is very likely main contains something new that makes pantsd startup slow. I'll start adding / upping debug to that process to find out where the time is being taken.
Copy code
>>> Took 38.039s to setup_graph
So all the time is slow rule graph setup. Doing a few more checks to confirm that is consistent.
Yeah, its consistent. From 20s to 50s across machines, both our self hosted mac runners and Linux arm and the gh Linux.
I'm pretty sure this past week's releases won't happen. Things are already bleeding into today with no path forward visible yet; so if you're waiting on stuff I'd be prepared to wait until this Friday's releases.
h
On my laptop setup_graph takes 3.5 seconds
This appears to only manifest in CI
e
Yeah, I'm digging further. Its still pretty unacceptable and did not used to happen. Afaict that is true too of our mac runners and those have not changed I think, so that seems to point back at Pants.
h
Yeah, for sure unacceptable, just trying to figure out what is different in CI.
I'm running
PANTS_CONFIG_FILES=pants.ci.toml USE_PY39=true ./build-support/bin/release.sh build-wheels
locally and it seems fine
Same exact command and same environment
setup_graph takes 6.3 seconds when I run it on the linux ARM machine directly via SSH. Odd that it's ~double the time on my M1, but still a far cry from 20-50 seconds
@enough-analyst-54434 I notice that in https://github.com/pantsbuild/pants/pull/17999 you're timing the creation of the bootstrap graph, but not the second call to
setup_graph()
in pants_daemon_core.py? In my measurements that first one is ~0.5-1 second and the other one is ~3-6 seconds. So timing that second one might be even worse than the 20-50 seconds you're seeing?
e
Seems not possible, since the cases I see are within 10% of the setup graph. The setup graph is 38s, the overal startup is ~45s, etc.
h
huh
So maybe that second one is taking just the expected 3 seconds. Which would be interesting information, since it should be building a bigger graph.
e
There are 2 graph setups. The 2nd is after ng starts, so does not affect the 60s startup timeout.
We have the bootstrap scheduler, that's blocking ng, then the full scheduler.
You can see the post startup 2nd scheduler log lines in CI too. Slightly longer setup_graph times, but comparable.
h
So here's what's driving me nuts: this is from the test shard -
Copy code
>>> Took 3.786s to setup_graph
>>> Took 5.115s to create PantsDaemonCore
What is special about the wheel build shards?
If I ssh into the same machine and run the wheel build steps manually - no slowdown
OOOOH there is one difference I didn't account for on those shards: we set
MODE=debug
That's it
with that I get 50 seconds
Ironically, according to the comments, we use MODE=debug on PR builds "to speed things up"...
OK, we have a smoking gun now
Now I can bisect
e
Ok, that makes sense. Thanks. I'm kinda done with Pants atm and losing steam.
h
I'll follow up on this, should be easy to bisect
So far I'm not sure there's a single commit that caused this, it might be more of a boiling frog
e
Ok, I'm going to send up another CI change PR then eliminate debug mode. The boiling frog won't be solved quickly I assume. I've long thought about serializing the rule-graph since there is 0 need to be doing this work on every startup with the same backends, but that's work, it's a hack masking the real problem, and who knows how much the payoff would be if any.
h
Turns out, not a boiling frog exactly. This single commit took the startup time from ~20s to ~50s: https://github.com/pantsbuild/pants/pull/17741
20s is still somewhat ridiculous, but that commit turned the heat up on the frog dramatically
Unclear to me why that change (which just set
cache_content_behavior = "fetch"
instead of
"validate"
) would have this effect
e
And its unclear why any cache fetches would be occuring before pantsd starts. That sounds like a bug to me.
I mean, not a bug I guess, but what cache content is there to fetch?! There is not even a rule graph yet.
h
Yeah, no idea
This makes so little sense that I'm double-checking
e
Ok, thanks. 1st result is back and 38 minutes for Linux x86_64 vs consistent 90 minute timeout before. So whatever advantage MODE=debug held in mid October has rapidly vanished.
h
OK I think that commit was a red herring
macbook power management is confounding the timings.
I'll set up something to measure this timing over a spread of SHAs, on a linux box, and see if a pattern emerges.
But right now "boiling frog due to increasing complexity of rule graph" seems like a decent hypothesis. The reason I care is that presumably there is a corresponding slowdown in non-debug mode, but we're not noticing it as much.