also, it looks like we don't set the `digest_hint`...
# development
a
also, it looks like we don't set the
digest_hint
in
PathGlobsAndRoot
anywhere. were we doing that manually in the parallel
RscCompile
task before? i'm looking to see whether we could support an optimization to avoid snapshotting process execution output files/directories that we already know about as per this comment on the mutable caches doc (which i didn't know was already implemented -- this is super awesome): https://docs.google.com/document/d/1n_MVVGjrkTKTPKHqRPlyfFzQyx2QioclMG_Q3DMUgYk/edit?disco=AAAAIva7gMw
❤️ 1
h
We don’t use PathGlobsAndRoot anymore in production. We realized we do need to expose it through the rules API though. I think we would keep the digest hint
a
yes! i am about to comment on stu's issue right now: https://github.com/pantsbuild/pants/issues/10842
stu's idea was better
the digest hint is a good idea for append-only caches, i think polling or
notify
works for caches like mypy's
and also i really like how there are possibly now two workstreams, one for parenting the mypy daemon locally, another that could make it remotable if we poll the cache dir
💯 1
and i think parenting the mypy daemon is likely to be more important for performance (and therefore higher priority), but i think that the
digest_hint
files can make pex invocations remotable while retaining the cache dir, so i think both have immediate use cases
h
I’m not sure how the daemon works, if it requires the cache to exist and be used. There is no way to turn off writing to the cache in MyPy. (I looked to save unnecessary work, since we don’t preserve it anyways)
a
i think that we could have a global cache dir for mypy the way we do for other process executions right now. i think that the daemon part would need to be a whole thing to implement (but would likely share code or ideas from the nailgun process execution code, which is currently unused), but if we could fit it into a normal
Process
execution (maybe a wrapper struct, or a separate field that says how to create the daemon), we could make it use the same kind of cache with a symlink
since we created a doc for the mutable caches, i think it might make sense to create another doc describing how we might implement a daemonized process invocation? i might do that -- @witty-crayon-22786 let me know if there is already such a doc
made a doc with an idea about a possible API for persistent workers in pants (right now, the mypy daemon seems most immediately useful). it doesn't try to specify how we would communicate with the daemon -- instead, the
client
process is expected to handle that part. it slightly modifies the API for nailgunnable processes -- all of that is freely bikeshedable. lmk if this was already done: https://docs.google.com/document/d/1hSspjRLGO05-tB16NevvUW87rqIKHjUZJ4xv2YfCL1c/edit?usp=sharing
w
good morning
a
good morning!
w
reading things
a
there were lots of things! i tried to edit all the github issues down to make them clearer. i recognize there's a lot, sorry about that
you'll see a github ping from me in a sec, it's just writing down your response on the mutable caches doc into the issue i created
thanks a ton stu that multiplied my efforts tenfold
w
Sure thing. I think the daemonized process API you've suggested looks good, and it's worth removing the parsing magic if we're going to use the API for more languages (or could moving the parsing to the python side to construct the more specific type).
🙌 1
a
ok, that makes me more comfortable
w
Would just caution that the "where is the working directory" and "where are the files" and "what is stable across runs" bits are what tripped us up before
👍 1
a
yes, that is what caused me to say "oopsie" and backtrack
thanks
w
Zinc and mypy both assume(d) stable paths, and I don't know if the Bazel worker API requires a form of sandboxing that would leave files "in the working copy"
✍️ 1
a
i'm interested in the FUSE part even separate from any daemons because i think it could just make everything ridiculously fast and i think i've looked at the
brfs
crate literally once so that seems like it might be a generally useful workstream
i mentioned on the ticket i think path rewriting might also be something we could make ~generic across tools (just by regex stuff mostly), so i don't think that's too unreasonable either and would be less upfront work (i really liked the idea of making mypy recursive)
i think those both probably touch the same general concepts too so will consider both
w
Yea. I wish that it were easier to experiment to see how fast/slow recursive mypy would be. But I don't know if there is a way to run the experiment much more easily than "actually doing all the path rewriting". Maybe by doing it only for hardcoded paths? Unknown.
a
i'll take a look at the files in the mypy cache and see if it seems reasonable. my hope is that if they're within the working dir we can just scan for the string of the cwd relatively unambiguously. my hope
hmmmmmmmmmmm there appears to be exactly one absolute path (or path at all) in the mypy cache output, and it's located the single top-level key
path
in every single json file (i believe there are only json files). this could be rewritten with some json parser although i suspect more quickly with regex
e.g.
Copy code
<.mypy_cache/3.5/uuid.meta.json jq '.' | g '/Users'
(standard input):62:  "path": "/Users/dmcclanahan/tools/pex/.tox/typecheck/lib/python3.8/site-packages/mypy/typeshed/stdlib/2and3/uuid.pyi",
i would need to think more about the pipeline that's needed here but this seems like a layup if anything. will delve more
and there are only json files (and one gitignore):
Copy code
> find .mypy_cache -type f | sed -re 's#.*(\.[^\.]+)#\1#g' | sort -u
.gitignore
.json
it actually appears that all of the paths are relative except the ones from the stdlib
Copy code
> find .mypy_cache -type f | parallel "echo -n '{}:' && jq -r '.path' <{}" | head -n3
.mypy_cache/3.5/test_resolver.meta.json:tests/test_resolver.py
.mypy_cache/3.5/atexit.meta.json:/Users/dmcclanahan/tools/pex/.tox/typecheck/lib/python3.8/site-packages/mypy/typeshed/stdlib/3/atexit.pyi
.mypy_cache/3.5/pex/testing.data.json:pex/testing.py
i'll dump this in the ticket
i think the absolute paths can be turned into relative ones if we materialize typeshed types into the process execution dir. there's another symlink optimization we might consider for that (noted in the ticket)
w
Interesting. Yea, dumping some info about that would be good. I thought I had put more info in the doc, but apparently not
There are timestamps in there as well, iirc
a
ah!!!!!
thank you
yes, that's correct. the
mtime
key in the
*.meta.json
. will add that
there appear to be some duplicated fields as well (e.g.
data_mtime
), oh bother
and there's a
platform
, which is still fine i think
posted a gist and a rundown of the unstable fields: https://github.com/pantsbuild/pants/issues/10864#issuecomment-699698439
i think someone should be able to pick it up with that info
wrote a comment, i think the recursive method will work with the rewriting scheme after each run and there's enough info to just do that
going to step off for a bit before trying to implement it so @hundreds-father-404 can take a look at it
❤️ 1
and will look into
brfs