I ve now had repeated instances of this happening in our CI Pants #general

I've now had repeated instances of this happening ...

high-yak-85899

04/14/2023, 11:21 PM

I've now had repeated instances of this happening in our CI system

Copy code

23:06:17.95 [INFO] Long running tasks:
  3567.56s	Scheduling: Determine Python dependencies for astranis-python/astranis/gnc/odet/gmat_common.py:sources
  3567.68s	Scheduling: Determine Python dependencies for astranis-python/astranis/gnc/gmat/queried_data_files.py
  3568.93s	Scheduling: Parse Dockerfile.

Basically ran for an hour and timed out the build. I wasn't able to reproduce locally, but this did fail when we attempted a rebuild. We're just running a

./pants package ::

command with some tag filtering. Any ideas of where to go chasing for what could be holding up this scheduling? The logs show it makes it through building one of our artifacts and then hangs on this for the remaining of the time.

witty-crayon-22786

04/14/2023, 11:34 PM

are you using a remote cache?

high-yak-85899

04/14/2023, 11:35 PM

We do have one. There is a warning higher up that was issued prior to the first artifact being built successfully

Copy code

19:34:41.41 [WARN] Failed to read from remote cache (1 occurrences so far): Unavailable: "Pants client timeout"

witty-crayon-22786

04/14/2023, 11:36 PM

are you using the

cache_content_behavior

flag?

high-yak-85899

04/14/2023, 11:37 PM

We have

cache_content_behavior = "validate"

pants.toml

high-yak-85899

04/14/2023, 11:37 PM

I'm realizing now that the build step doesn't point to the correct remote cache server

witty-crayon-22786

04/14/2023, 11:37 PM

i would remove that setting, and allow

cache_content_behavior

to default to `fetch`… it can experience this kind of issue.

witty-crayon-22786

04/14/2023, 11:38 PM

tracked by https://github.com/pantsbuild/pants/issues/16667

witty-crayon-22786

04/14/2023, 11:39 PM

more generally,

cache_content_behavior

is going to need some more work to stabilize, unfortunately

witty-crayon-22786

04/14/2023, 11:44 PM

OR, if your remote cache is completely infallible, then you might be fine… but… network services are ~not

high-yak-85899

04/14/2023, 11:45 PM

Thanks for the advice! I'm actually going to disable the remote caching for this step entirely while also changing to fetch. The caching has been great for unit tests, but I'm not as concerned about it for these artifact generation steps. They generally happen fast enough that the caching isn't important.

witty-crayon-22786

04/14/2023, 11:45 PM

hm. that shouldn’t be necessary, but up to you.

witty-crayon-22786

04/14/2023, 11:45 PM

caching in general is very trusted… it’s just that setting that needs more work.

high-yak-85899

04/14/2023, 11:46 PM

Right. Still going to leave it enabled for the testing. But we weren't using it for these steps anywhere.

high-yak-85899

04/14/2023, 11:47 PM

Because of unrecognized user error specifically.

high-yak-85899

04/14/2023, 11:47 PM

We have one remote cache on premises and one in AWS and we were pointing it to the wrong one this whole time. Will be a slight can of worms to rectify so I'll let our (hopefully soon) new devops hires sort it out.

witty-crayon-22786

04/14/2023, 11:48 PM

i would view that as an opportunity to make things faster for free by fixing it then 😉

witty-crayon-22786

04/14/2023, 11:48 PM

but yea, sounds good. sorry for the trouble.

high-yak-85899

04/14/2023, 11:50 PM

Hmm, yeah, you're right. I bet I can fix it all the way down properly.

high-yak-85899

04/14/2023, 11:50 PM

I just have a case of the Fridays I think

❤️ 1

2 Views

Open in Slack

Previous Next