<@U06A03HV1> I suspect that the CI timeouts from t...
# development
@witty-crayon-22786 I suspect that the CI timeouts from the immut PR had to do with contention of some kind. Just running a test alone has it completing in 200s, instead of the 1000s we saw when running all tests
I was never good at concurrent code, src/rust/engine/fs/store/src/immutable_inputs.rs seems like a likely candidate
So, at a very high level all requests of an immutable digest block all others, the lock wraps the complete map: https://github.com/pantsbuild/pants/blob/main/src/rust/engine/fs/store/src/immutable_inputs.rs#L65 So that just means if one digest takes 10s to materialize, it will head of line block all other digest requests. Then one of the waiters gets through and its digest is different and it takes 25s, and so on. So folks in the back of the line are waiting 10 + 25 + ...
There may be more at a finer level, but that high level should explain / account for alot.
Ah, no. The lock is dropped in the middle of that chained call after grabbing an entry. So the only head of line lock is for all waiters on that single digest. It there is a single digest 100 integration tests wait on though, they all have to wait to start.
Yeah I had that same chain of thought, although admittedly wasn't confident about it
really easy is to add some prints, you'll see whose waiting.
I'd like to blame the digest lock, but only certain tests are taking a long time, and I don't feel confident that those tests, and those tests alone share "biog" digests
You probably won't like this, but I will say my use of debuggers evaporated once I started having to deal with concurrent code and distributed systems. Prints are always there even if you can't be and they, for the most part, do not rock the thread scheduling boat too much.
Yeah understandable
Logging implies we're barking up the wrong tree perhaps. And actually, this kind of makes sense, since what I'm seeing is a few tests absolutely tentpole the hell outta CI. If we were fighting over the resources I'd expect several tests to be having this affliction.
In the worst offender (pyox related) I see
======================== 2 failed in 1717.01s (0:28:37) ========================
which tells me that pytest is observing the slowdown from end-to-end. So we can rule out anything that happens in Pants before or after it
But don't we have ITs that call back into Pants via subprocess? This may not be one of those, but that class of test does not prima facie rule out anything Pants internals.
This is, in fact, one of those kinds of tests.
(I did print stderr in the test, and had it fail, so we'd see the log lines from the daemon inside the test)
On a whim I disabled
and my test time went from >1000s to 500s 🤔
And down to 170s with no contention+maximum caching
Ahh a clue! When coverage got turned off, the log from buildizng PyOxy binaries looks normal. When I turn it back on...
000052.97 [INFO] Canceled: Building tmpyyq4pz8q/hellotest:bin with PyOxidizer
000054.02 [INFO] Filesystem changed during run: retrying
in 500ms...
There it is:
154305.00 [INFO] notify invalidation: cleared 2 and dirtied 36 nodes for: {"src/python/pants/engine/internals", "src/python/pants/engine/internals/native_engine.so"}
Does adding another hardlink dirty the node?!
One last change on my PR to log what the notify event is thats being reported
I heavly suspect that is being emitted for the hardlink count changing
``` IN_ATTRIB (*)
Metadata changed--for example, permissions (e.g., chmod(2)), timestamps (e.g., utimensat(2)), extended attributes (setxattr(2)),
link count (since Linux 2.6.25; e.g., for the target of link(2) and for unlink(2)), and user/group ID (e.g., chown(2)).```
So to wrap this up, I'm wondering if Pants' file watcher really cares about metadata changes? We shouldn't care if the access time or write time changes since we really only care about the name/content, right? https://docs.rs/notify/latest/notify/event/enum.MetadataKind.html
🙌 1
👀 1
🔥 1
We definitely care about non-exe becoming exe.
That's the only md I think though.
That is true. I wonder if we can capture the ones we DO care about in order to ignore the rest (namely
I'll let this PR stew (pun intended) https://github.com/pantsbuild/pants/pull/17875
Nice work!
Since this only impacts pants integration tests, I don't think we need a super rigorous solution.
No, but I also wonder if some of those mysterious invalidation I saw way back in the day were related to this.
Do you use hard links in your repo?
Psssssssshhhh who the hell knows what goes on in our mountain of scripts 😂
🚢 : sorry for the delay.
You're on 🌴, all good