<@U02TSJJT9DJ> Breaking out this convo from the or...
# development
g
@wide-midnight-78598 Breaking out this convo from the original thread to avoid spamming it. Do we have any JVM afficionados here who can opine on Nailgun? It seems quite abandoned, and f.ex. Buck2 dropped it while keeping their Java support.
w
@witty-crayon-22786 and @fast-nail-55400 would probably be my first two go-to sources on Nailgun specifically and then JVM more generally
f
Yeah, it's dead. I created a PR to disable it by default at first here https://github.com/pantsbuild/pants/pull/22072
Buck2 and others don't use pooling for jvm anymore, they pretty much just rely on caching afaik
w
Two things: 1) Pants uses an implementation of the nailgun protocol in Rust, so the health of the JVM/C implementation isn't that important. 2) we use it for two things: a) the connection to
pantsd
, b) JVM pooling
I would be very surprised if buck or bazel had stopped pooling JVMs, since JIT warmup time is still critical. Would be interested to see what they are doing instead.
On the
pantsd
side, both the client and server are rust. On the JVM pooling side, the client is rust and the server is the JVM
But: long story short: I expect that if you disable nailgun for the JVM without a replacement pooling strategy, it would have very unfortunate performance impacts
f
I would be very surprised if buck or bazel had stopped pooling JVMs, since JIT warmup time is still critical. Would be interested to see what they are doing instead.
In practice they have JVM-warm machines doing JVM-work. But I'm fairly sure the build system does not pool that work itself
expect that if you disable nailgun for the JVM without a replacement pooling strategy, it would have very unfortunate performance impacts
I'm not sure if it is actually that much of an issue in practice, we already disable it in CI because it does not produce consistent output either. Locally yes there's a case for the startup time, but I don't think it's that excessive anymore and also there are few-to-no decent alternatives
There I think options like GraalVM could be more interesting than a pooling solution
w
Last I checked, AOT compiling with GraalVM wouldn't work with compiler plugins or macros. But it's possible that it has gotten better at dynamically loading jars.
In any case: I would just recommend polling other JVM users before changing it too much. My strong suspicion is that it is still necessary.
w
Practically, I would see this as a 2-stepper anyways: 1. Remove if there are no JVM backends (which may be more work than I assume it is) 2. Deal with the JVM-enabled side in some fashion (remove, tweak, alternatives, etc)
g
Could we just reverse it to default-off? Non-breaking, JVM users can enable...
w
default-off is breaking though
Maybe deprecate it or something? After Stu's suggestion of a poll
w
Yea, it should definitely be disabled if no JVM backends. But I mean... I'm not sure how it could possibly be enabled if there are no JVM backends... there wouldn't be any JVMs to warm
w
I wondered the same - when I saw this problem in the pants repo, I had thought I had removed all the JVM code to speed up startup, but was still getting nailgun'd - but that might have been a repo-specific thing, or one of the backends transitively enabled it 🤷
w
The pants repo has JVM backends enabled
g
We have it at work, as well, and we've never had any java code anywhere.
f
It runs outside outside of the JVM backend
So it always runs
w
The pants repo has JVM backends enabled
This was after a diff to remove everything JVM related to cut startup from 25+ seconds to 6
w
So... I think that you might be referring to just the log message which was disabled here...? https://github.com/pantsbuild/pants/commit/0e661a10024a3ca21714e882ecf8bdd766e90eb2
The rust client is always installed, but it will never actually do anything if there are no JVM-flavored process requests
...i.e. if the JVM backends are not installed
So maybe that needs a backport somewhere.
g
It is definitely a noticeable cost. When we disabled it at work I estimated a total savings of ~a minute across all ci steps.
w
Disabled what?
f
Alright, fair enough 🙂 A poll sound fair, I think it should be phased out since it is archived, unreliable and also does not support the latest JVM versions already. But maybe there are more concerns. I think a fair share of users of JVM may have it enabled without necessarily knowing so or maybe even what it does
g
Disabled what?
Nailgun
w
@gorgeous-winter-99296: and you have no JVM backends installed?
Just to be crystal clear: which flag are you talking about?
process_execution_local_enable_nailgun
?
The effect of that flag should be to disable creating the nailgun process runner, which is wrapped in a SwitchedCommandRunner: https://github.com/pantsbuild/pants/blob/main/src/rust/engine/src/context.rs#L247-L277
g
I will double-check after dinner, but yeah. We should only have Python backends plus the odd ones like shell, taplo, etc. The flag sounds like the one we use yeah. Can redo the measurements too, see if it repros.
w
Huon's comment above on the info->debug change should be accurate: creating the pool has no cost, since it starts empty
f
would be very surprised if buck or bazel had stopped pooling JVM
Double checked to make sure I remembered correctly, and yes neither bazel or buck2 pool jvm processes. Bazel does now have an experimental feature of multiplex workers which can help with what I mentioned about them using warm workers https://bazel.build/remote/multiplex
w
buck2 doesn't have working JVM rules, afaict?
And yea, bazel moved to a different pooling strategy.
But pooling is critical for local latency: just try using
pants --watch check
without it, right?
It sounds like the non-JVM folks in this thread want it to be quiet/get-out-of-the-way, which is reasonable. But you'd definitely want to check with other JVM users before changing whether it is enabled: the fundamentals of JVM JIT warmup haven't changed in ... three decades
f
Yeah for sure I agree with having a poll. Just thinking about further options. Maybe an option instead of a pooling alternative could be to route tasks to workers that can be warm for a workload, for example in the case of jvm?
w
Let's move a discussion of alternative pooling strategies to another thread maybe?
👍 1
f
buck2 doesn't have working JVM rules, afaict?
Sorry missed that, yes but maybe not OSS yet
f
Interesting, they have docs for Java: https://buck2.build/docs/prelude/globals/#java_library
w
f
wow
Releasing a non-working feature but not documenting that fact ... good job Meta!
😅 1
g
Ok; I am wrong, sorry. Hyperfine cannot reproduce. Confirmation bias playing us... we started with this snippet posted in a Slack channel, asking what it was:
Copy code
2024-11-19 10:42:13 UTC	10:42:13.71 [INFO] Initializing Nailgun pool for 24 processes...
2024-11-19 10:42:19 UTC	10:42:19.92 [INFO] Scheduler initialized.
So I set up a PR disabling it, and when compared to the previous commit multiple steps did indeed get faster, mostly in the range of 3-6 seconds. Reviewing more commits in that time range, it's just noise...
w
There is an annoying (and afaik unnamed?) phenomenon, where when a log message says that something is starting, but there is no corresponding message saying that it is finishing, it causes no end of confusion.
g
I found the original log snippet in Buildkite while reviewing this now, and in this case the question just omitted (or didn't notice) the "initializing scheduler" message.
f
Did you want to do a poll on the question about the default? (I don't know how or if I can)
Releasing a non-working feature but not documenting that fact ... good job Meta!
Yeah buck2 is pretty much the essence of open sourcing things that are not ready to open source yet
But on topic, outside of the scheduler init latency, the other points still hold so I still consider it worthwhile to deprecate in a fair, non-disruptive way
But pooling is critical for local latency: just try using
pants --watch check
without it, right?
Yeah I don't oppose that there's something needed to deal with that, just that Nailgun isn't it anymore
There is an annoying (and afaik unnamed?) phenomenon, where when a log message says that something is starting, but there is no corresponding message saying that it is finishing, it causes no end of confusion.
I like "log fog" for that, and it's everywhere and sometimes the other way around, only logging when finishing can spur equal levels of confusion