How do I debug why one rule isn't being cached? It...
# plugins
g
How do I debug why one rule isn't being cached? It looks like it should be cached; but it always runs. It's an EngineAwareReturnType, but cacheable is True in every situation I can see. This is the rule in question, and this is the return type.
c
that rule wonโ€™t be cacheable unless everything in it is as well.. so I would look at the caching option on https://github.com/tgolsson/pants-backends/blob/ts/large-image/pants-plugins/oci/pants_backend_oci/util_rules/pull_image_bundle.py#L70-L75
g
cache_scope=<ProcessCacheScope.SUCCESSFUL: 'successful'>, so that should be fine?
c
yea, hmmm
g
As it's a ProcessResult it'd not continue building; either way.
c
so that was cached then.. interesting.
g
Oh, do we perhaps not write cache but do read it when not using pantsd?
c
if you log the hash value of your return value is it the same everytime?
g
Will double-check! But it shouldn't matter, no? The inputs are the same.
c
without pantsd there is no caching, unless you have remote cache perhaps..
only memoization for each run individually, I think
g
So what I'm observing is that it does Download, and then everything else which should take 10 minutes, goes instantly.
b
without pantsd there is no caching
That's not true. You get the process and digest caches (the "on disk" caches). What you don't get is memoization
c
oh, so the other way around
g
Even with pantsd it does a full download.
c
thanks
b
--no-local-cache
is what turns off local disk caching
g
Hmm, could the time taken be just reading the cache? ๐Ÿ˜• I'm noticing the output digest is suspiciously small..
Copy code
16:00:07.04 [INFO] stdout: "\n"
16:00:44.47 [INFO] Completed: Download OCI image <http://docker.io/nvidia/cuda@sha256:a0e758e31317b132713cba8b76d3cb66f2317755685b87b55e9c092ded772042|docker.io/nvidia/cuda@sha256:a0e758e31317b132713cba8b76d3cb66f2317755685b87b55e9c092ded772042>
16:00:44.47 [INFO] stdout: "Digest('da9082270949af9bf7c5bee5c743d4b6f0145a1519feb29af33e0ff9cb4fafbb', 80)"
37 seconds, 80 bytes...
b
FWIW I think the
EngineAwareReturnType
logs the output even if its rule was cached. See the lint rules. I don't remember completely though, so warrants confirmation
g
Not sure I follow. The time is definitely elapsing, and the printout from before is from inside the rule. So the rule itself is running.
b
OK then ignore me ๐Ÿ™‚ I'll let the more brainy folks chew on this
g
(Also the output digest is constant, so it's at least getting the same data each time.)
b
I'm noticing the output digest is suspiciously small..
I'm guessing the output is the directory digest. The bytes of a directory digest is essentially just the tree structure of pointers to the other directory/File digests
g
Ah, right. So it's not actually the full length of all files. Makes sense, as this is a single root directory of an OCI dir.
๐Ÿ‘ 1
b
Dumb question time. Is the exit code 0? Has the exit code ever not been 0?
g
Copy code
metadata=ProcessResultMetadata(total_elapsed_ms=35642, _source='ran_locally', source_run_id=2))
Copy code
exit_code=0,
I believe it has been non-0 at some point in the past, yes. ๐Ÿ˜›
b
Yeah I got nothing but straws so ๐Ÿ™Š
g
OK; let's flip the question. Why does it run any rules? The way the OCI build works is that it recursively from a target builds from the ground up, so each image starts by invoking build for its base, then its build steps, and then config. I.e, if we build image
foo
ontop of
bar
and
bar
ontop of
qux
, we're seeing that when building
foo
we start by downloading
qux
and then all other work is automatically done. But wouldn't
foo = await Get(...)
also be cached and thus no subrules execute? So maybe the issue occurs much further up the chain, but then when this stage resolves correctly the rest falls into place.
h
I can't take a closer look until later in the day, but I vaguely remember that there is a key rule in the Docker backend that is purposely not cacheable (I made this change a few months ago).
I can('t type apparently)
g
This is completely separate from the docker stuff, just to be clear! It builds containers, but doesn't share a line of code outside of that.
h
Ah, OK, NM
so it doesn't rely on side-effecting into any external state repository like a docker daemon?
g
Nope, nothing leaves pants.
h
that is where things get hairy and confusing with the docker backend
"nothing leaves pants" - words to live by
g
^ why I won't use anything using docker ๐Ÿ˜„
h
Is this a plugin you're working on? Sorry, haven't been following along. It sounds excellent.
g
Yeah; I've been hacking on it for quite a while. Use it in production at Embark to publish our cloud training containers. ๐Ÿ™‚ Very dogfoody still, though...
FWIW; this just started working randomly and now works 100%. I am... very confused. I've reverted all debug changes and it still works fine. Something something commiting the cache?
h
You probably already know this, but to clarify caching: There are two layers. Rules are cached in memory on pantsd, Processes are cached on disk in the lmdb_store, so that cache survives a pantsd restart.
maybe something something pantsd restart, and the issue is with memoization?
Or are you expecting a Process to be cached and isn't?
g
I've been running with
PANTS_PANTSD=false
all this time.
h
Ah...
that will pretty much kill you
g
Yeah; but I'd expect it to be consistent. I found the issue before adding it. ๐Ÿ˜› And now I have working cache with it, still.
So now I can see it immediately short-circuiting the process from cache; so that works as expected. I'll see if I can debug it more later... My latest few changes were things like
def ___hash___
on my subsystem, and modifying various strings on it. I'm wondering if that somehow affected it... But that would be rule cache if I understand it correctly, not disk cache.
w
the only thing that will affect the Process cache is the Process struct itself. if you run with
-ldebug
you can look for
spawned local process as ...
lines, which include the entire process struct.
if you hit the local cache, you wonโ€™t see those lines at all
g
Yeah, it did spawn local process, until it stopped doing it. :(
w
k. well, when you see one of those lines, examine whether it contains any unstable dataโ€ฆ usually environment variables, but possibly also argument order if you are iterating over `set`s, for example
g
Grep'd through the logs and all of the runs that had
-ldebug
show the same process spawn line. I didn't have
-ldebug
on for most of them though because the output is a bit too dense... ๐Ÿ˜ž
I'll see if I can repro with a clean cache when I get time, otherwise I assume PEKBAC with PANTS_PANTSD and some weird local state.