I've now had repeated instances of this happening ...
# general
h
I've now had repeated instances of this happening in our CI system
Copy code
23:06:17.95 [INFO] Long running tasks:
  3567.56s	Scheduling: Determine Python dependencies for astranis-python/astranis/gnc/odet/gmat_common.py:sources
  3567.68s	Scheduling: Determine Python dependencies for astranis-python/astranis/gnc/gmat/queried_data_files.py
  3568.93s	Scheduling: Parse Dockerfile.
Basically ran for an hour and timed out the build. I wasn't able to reproduce locally, but this did fail when we attempted a rebuild. We're just running a
./pants package ::
command with some tag filtering. Any ideas of where to go chasing for what could be holding up this scheduling? The logs show it makes it through building one of our artifacts and then hangs on this for the remaining of the time.
w
are you using a remote cache?
h
We do have one. There is a warning higher up that was issued prior to the first artifact being built successfully
Copy code
19:34:41.41 [WARN] Failed to read from remote cache (1 occurrences so far): Unavailable: "Pants client timeout"
w
are you using the
cache_content_behavior
flag?
h
We have
cache_content_behavior = "validate"
in
pants.toml
I'm realizing now that the build step doesn't point to the correct remote cache server
w
i would remove that setting, and allow
cache_content_behavior
to default to `fetch`… it can experience this kind of issue.
more generally,
cache_content_behavior
is going to need some more work to stabilize, unfortunately
OR, if your remote cache is completely infallible, then you might be fine… but… network services are ~not
h
Thanks for the advice! I'm actually going to disable the remote caching for this step entirely while also changing to fetch. The caching has been great for unit tests, but I'm not as concerned about it for these artifact generation steps. They generally happen fast enough that the caching isn't important.
w
hm. that shouldn’t be necessary, but up to you.
caching in general is very trusted… it’s just that setting that needs more work.
h
Right. Still going to leave it enabled for the testing. But we weren't using it for these steps anywhere.
Because of unrecognized user error specifically.
We have one remote cache on premises and one in AWS and we were pointing it to the wrong one this whole time. Will be a slight can of worms to rectify so I'll let our (hopefully soon) new devops hires sort it out.
w
i would view that as an opportunity to make things faster for free by fixing it then 😉
but yea, sounds good. sorry for the trouble.
h
Hmm, yeah, you're right. I bet I can fix it all the way down properly.
I just have a case of the Fridays I think
❤️ 1