curious is anyone has any thoughts, ideas or prior...
# general
r
curious is anyone has any thoughts, ideas or prior art on pip-orchestrated composition of pypi distributions + monorepo-materialized distributions in off-the-shelf virtualenv environments? we seem to be butting our heads against this in the ML/DS space and I have a few ideas that I wanted to discuss w/ likeminded folks.
👋 1
w
hey!
you might have seen that @enough-analyst-54434 is polishing a
--venv
execution mode for pex… the initial usecase is for lower latency execution rather than for export, but export is something for the medium term to better support IDEs/notebooks
r
yeah, our primary use case in the DS/ML world is notebook-centric development. our users would love to be able to e.g.
Copy code
!pip install some.pypi.dep==1.0 some.monorepo.target==2.0
and then mutate at will with seamless transitive dep compatibility (i.e. export).
right now we have the ability to rapidly bootstrap our monorepo into a notebook + some jupyter magics for build-and-load in a pex-aligned way w/ hermetic env scrubbing.
w
yea
r
but we have a faction of users that are pure pip-mode folks and it’d be nice to have a bridge layer between that mode -> materialization of monorepo targets as exportable native python dists.
w
i.e., something where you can edit the sources without regenerating the export?
r
I was sort of thinking about a dynamic index server that would materialize non-transitive python dists that would map 1:1 from namespace<>monorepo address space. e.g.
!pip install monorepo.src.python.coolthing
-> with materialization unrolling inner source and 3rdparty deps.
w
via symlinks, or via copying?
(a different way to ask the “do you need in place editing” question, i suppose)
r
copying. and emphasis on non-transitive, i.e. rather than bundling transitive src deps those would just express their src deps under the namespace mapping.
and for this mode, no in-place editing. there wouldn’t even be a copy of source around.
the idea would be sha-consistent materialized builds (e.g. local version identifiers w/ git sha)
and perhaps branch support
w
oooh.
got it. so for the deployment side?
r
sort of. we’d term this e.g. the “experimentation environment” where you’d have e.g. a jupyter notebook you’re hacking away on and then being able to seamlessly pip install whatever.
w
so a client is editing in a workspace, and a server is materializing in a write-only copy?
r
you could imagine an off-the-shelf GCP AI Platform Notebook instance that you could just config a secondary pip package index against and then be able to seamlessly install monorepo-based packages from.
and the thing the user is editing is their .ipynb file (which might contain e.g. a TFX Pipeline that creates an ML Model) vs the source code of the installed libraries themselves.
w
iii see. so creating source distributions for everything and hosting them in a pip compatible server …?
r
yeah, except ideally without the ongoing storage costs of eager build-per-sha.
w
the source packages you created would need to look pretty content addressed and strange to make that work i think… but it could?
r
and/or without the per-project overhead of build-and-publish to binary index in CI + version bumps.
yeah. I haven’t quite worked out the version<>sha mapping layer.
in particular, representing “master” seems tricky due to immutable version guarantees the index has to provide.
but I do think it’d be quite nice to be able to deterministically pin down a bug to a git sha from a pure python distribution versioning scheme (e.g.
print(module.__file__)
-> indicates src sha.
w
if
pip
is part of the requirements, that’s one thing… if it’s not, i’d still maybe say that leaning in on using the remote execution CAS api as a glorified, persistent rsync would work fairly well (or just rsync)
r
its not clear if there’s something workable in an e.g.
0.0.0.devN#<sha>
versioning scheme where N monotonically increments + inexact version matching on the pip side (iirc, local version identifiers can be glob matched as e.g.
.*
).
yeah,
pip
here is fundamental.
pip
needs to manage the remainder of the virtual env (composed of both pypi packages + monorepo sourced packages) in a consistent way.
which means e.g. subsequent rounds of
pip
invocations may change the env (mutable) and/or want to seek “newer” versions of the monorepo src packages to resolve conflicts (e.g. if a 3p version is bumped in the monorepo and wants to be realized in the venv).
thats sort of the crux.
w
yea.
nailing down the mapping should be … relatively straightforward? for the synthetic source packages, you could likely lie and say that they have no dependencies
and then the workspace could have a root package containing all of the transitive deps, flattened and content addressed.
how you maintain that (or lazily construct it when someone hits the server) would be pretty implementation specific, but.
alternatively, give them actual intermediate deps… but i’m not sure it buys you much. would mostly increase resolve time i think.
r
mapping, yes for sure. tho I think you would want explicit deps for the synthetic source packages (incl at the source deps level, recursively via the mapping)? basically, via the mapping they’re no longer synthetic - they’re realized modules.
you’d definitely want to express 3p deps this way too and just let pip handle those.
w
is there an advantage to the realized synthetic module having pip-level requirements…?
agreed re: 3p
r
much smaller package builds + payloads I think? w/ natural transitivity.
and no-op’ing
w
the counterpoint with transitive is that it magnifies your storage quite a bit (if you’re storing)
in
a->b->c
, if
c
changes, need to publish new copies of
b
and
c
and pip needs to re-solve/resolve all the intermediate deps. but you have “already executed” the resolve, so can make its job easier
r
true
I don’t imagine long term storage of the materialized dists. 30d caching at best - expect them to get invalidated frequently.
w
(it also means that
b
and
c
need to be refetched into the workspace, even if they only changed transitively)
r
yeah, our notebooks run in the DC tho at lan speed fetches.
but in practice for a widely sharded target I could imagine slower resolves
cold perf may suffer
prob a lever to experiment with.
w
for even tiny resolves, pip is … not fast
r
yeah, esp when you consider transitive inline pants v1 build costs 😉
w
yea, could be a lever… batch size or something.
r
maybe you could tag a python_library w/ a control flag
tags = {'pip.transitive'}
and then point to e.g. “environment” target addresses for single-dist synthesis and lifecycle
tho that might not compose well w/ adhoc uses due to shadowng
w
depends whether push or pull in the server, maybe
r
definitely curious on others perspectives here, esp @enough-analyst-54434 re: the topic of pip driven modes of pex interop.
e
There are way too many words and acronyms above for me to grok / comment coherently .... but can the monorepo stuff be handled via pin install -e where the monorepo supplies an sppropriate setup.py ?
r
I think that solution is perceived as non-scalable/too much boilerplate. have to keep setup.py declared deps in lockstep w/ monorepo etc.
w
the setup.py would be synthetic, i think? but @enough-analyst-54434: i think he’s essentially suggesting lots of synthetic whls with/without setup.py
e
That sounds like presuming the details of the solution. In v2 at least the setup.py could be autogenerated with ~0 boilerplate.
r
ah, so
pip install -e ./dist/xxx
?
the mode we’re trying to cater to is e.g. “vanilla jupyter notebook on GCP” having trivial reach to the monorepo
(without a copy of the monorepo, or access to pants, etc)
but at that point pants could just produce .zip files (sdists) that could be pip installed?
I think there’s a service component + pants interface component here, so that would definitely work for the latter sure - tho going the extra step of converting to wheel seems easy too.
a generalized service component could shim the build system layer and make that pluggable - this could do pants v1, v2, bazel, gradle, whatever
e
"“vanilla jupyter notebook on GCP” having trivial reach to the monorepo"
You'll have to excuse me, but this is where you lose me. Maybe you could write down some command lines a rando like me could run that give the current crappy solution and then a description of what you'd like to happen? I have no clue what a GC or any of the other stuff is.
r
https://colab.research.google.com/#create=true ->
!pip install pex
<cmd-enter>
import pex; print(pex.__version__)
<cmd-enter> then the goal state is:
!pip install monorepo.src.python.a.b
with similar results for a mono-repo sourced
python_library
vs pypi package.
e
OK - ty - much better.
r
(and then being able to
!pip install
additional 3p/monorepo libraries iteratively and have them stay consistent)
e
Yeah - so an index server that could translate `monorepo.src.python... `or else a daemon that continually emitted / removed synthetic setup.py in a monorepo src tree seen by the notebook allowing
pip install -e monorepo/src/python/... seem like the two obvious ways.
The index server begs the question of what versions mean for monorepo.src... - particularly mixed ones. The daemon that emits setup.py in a src tree doesn't allow that,
r
yep!
e
K - so what's the question then?
This all just seems like fact so far
w
content addressed is the conclusion that we came to above.
e
I don't understand at all. That says nothing about the choice of allowing mixed versions vs disallowing, etc.
I still don't know what the question is. Is the question - do we allow mixed monorepo versions? Or is the question about techinical feasibility or speed or what?
w
my understanding is that we’re discussing feasibility. i think the assumption is that versions can be a solved problem.
e
Ah, ok.
w
(but i’ll let kris answer, sorry.)
r
mixed version control + the mapping layer from python package version to git sha (i.e. how do you describe “master/latest” for unspecified versions or upgrades) is probably the foremost unknown.
e
Yeah - I don't know. Is there actually a reason to allow mixed monorepo versions? That goes against monorepo source and just makes everything harder. Without it this all gets a lot easier conceptually. You just need a mechanism to set the version of the monorepo as a whole to.
r
also granularity - does each address mapping provide a transitive dist or an intransitive dist w/ transitivity specified as requires (e.g.
python_library(…, deps=['src/python/x/y']…)
export as
requires=['monorepo.src.python.x.y==X']
e
So .. Is there actually a reason to allow mixed monorepo versions?
r
for correctness, no no reason to mix versions; for speed, maybe
I guess its a matter of target fingerprint over git sha at that point
e
You guys are way too down in the details. The git sha in my mind just means for whatever dists you want to install from the monorepo, you get dists that come from the same sha, not that you get the whole monorepo at that sha.
Is that the spirit?
w
yea, that sounds like it.
r
yep
w
depending on whether the server is populated eagerly/push or lazily/pull, a mapping from “git branch or sha” to “content address” might be 1) optional, 2) only used to get the “root” of the thing you want to install.
☝️ 1
e
OK. I'm still lost on what the question is. Clearly this is feasible. It sounds like you want to start getting into perf details of how to actually implement.
Is that right?
w
if it’s eagerly/push, then you could imagine a user running a command to push to the server, and the tool reporting a content address to use to fetch
r
mainly a sanity check, I suppose
good idea/bad idea vs alternative approaches etc.
e
Ah - ok. Sounds like a great idea to me! A tool to expose dists from a monorepo in a lightweight way so you can interact with tools like pip makes a ton of sense.
👍 1
Since now an IDE can use it too, etc. For example Jetbrains knows how to install missing dists which it does via pip / venv. That stuff would just work, etc.
r
hmmmm yes! totally.
I think this extrapolates to language servers too
e
Ok. These conversations always are consternating. The details were a wall of text that seem to have little to do with this conclusion. Thanks for bearing with me,
coke 1
r
and yeah, now I get the “eagerly/push or lazily/pull” thought much more clearly re: index building.
k, thanks for all the input. I think I’m going to motivate a proposal for this and attempt to fund through one of my teams - do folks see merit in (co-)building something like this in the open?
w
that question might depend a bit on the implementation details i think… how generic it is, the push/pull aspect, etc?
e
Are you trying to do this generaically or with a specific build tool powering it? IE: bazel -> bazel-dist-server -> pip or the more ambitious bazel -> dist-server-protocol -> pip ?
r
we’re unfortunately stuck on pants v1 (and then bazel) but think we can likely generalize build system interfaces there accordingly - the q is whether generalizing is worthwhile to other applications of this system.
e
I think I owe Stu a Coke.
coke 1
I personally would not generalize until its running.
That said - I'd love Pants v2 to support this at some point.
r
yeah, I think this would be a forward thinking interstitial build system capability for any ML+python shops - hence it may provide value to Toolchain to collab from day 1 for demoability - or we can run with it and collab later.
it remains quite unfortunate to me that we’re wedged into pants pre-v2 but that’s life
I’ll propose an internal prototype/PoC and we can go from there.
👍 1