https://pantsbuild.org/ logo
r

rough-minister-58256

01/05/2021, 9:53 PM
curious is anyone has any thoughts, ideas or prior art on pip-orchestrated composition of pypi distributions + monorepo-materialized distributions in off-the-shelf virtualenv environments? we seem to be butting our heads against this in the ML/DS space and I have a few ideas that I wanted to discuss w/ likeminded folks.
👋 1
w

witty-crayon-22786

01/05/2021, 9:59 PM
hey!
you might have seen that @enough-analyst-54434 is polishing a
--venv
execution mode for pex… the initial usecase is for lower latency execution rather than for export, but export is something for the medium term to better support IDEs/notebooks
r

rough-minister-58256

01/05/2021, 10:04 PM
yeah, our primary use case in the DS/ML world is notebook-centric development. our users would love to be able to e.g.
Copy code
!pip install some.pypi.dep==1.0 some.monorepo.target==2.0
and then mutate at will with seamless transitive dep compatibility (i.e. export).
right now we have the ability to rapidly bootstrap our monorepo into a notebook + some jupyter magics for build-and-load in a pex-aligned way w/ hermetic env scrubbing.
w

witty-crayon-22786

01/05/2021, 10:05 PM
yea
r

rough-minister-58256

01/05/2021, 10:06 PM
but we have a faction of users that are pure pip-mode folks and it’d be nice to have a bridge layer between that mode -> materialization of monorepo targets as exportable native python dists.
w

witty-crayon-22786

01/05/2021, 10:07 PM
i.e., something where you can edit the sources without regenerating the export?
r

rough-minister-58256

01/05/2021, 10:07 PM
I was sort of thinking about a dynamic index server that would materialize non-transitive python dists that would map 1:1 from namespace<>monorepo address space. e.g.
!pip install monorepo.src.python.coolthing
-> with materialization unrolling inner source and 3rdparty deps.
w

witty-crayon-22786

01/05/2021, 10:08 PM
via symlinks, or via copying?
(a different way to ask the “do you need in place editing” question, i suppose)
r

rough-minister-58256

01/05/2021, 10:09 PM
copying. and emphasis on non-transitive, i.e. rather than bundling transitive src deps those would just express their src deps under the namespace mapping.
and for this mode, no in-place editing. there wouldn’t even be a copy of source around.
the idea would be sha-consistent materialized builds (e.g. local version identifiers w/ git sha)
and perhaps branch support
w

witty-crayon-22786

01/05/2021, 10:11 PM
oooh.
got it. so for the deployment side?
r

rough-minister-58256

01/05/2021, 10:12 PM
sort of. we’d term this e.g. the “experimentation environment” where you’d have e.g. a jupyter notebook you’re hacking away on and then being able to seamlessly pip install whatever.
w

witty-crayon-22786

01/05/2021, 10:13 PM
so a client is editing in a workspace, and a server is materializing in a write-only copy?
r

rough-minister-58256

01/05/2021, 10:13 PM
you could imagine an off-the-shelf GCP AI Platform Notebook instance that you could just config a secondary pip package index against and then be able to seamlessly install monorepo-based packages from.
and the thing the user is editing is their .ipynb file (which might contain e.g. a TFX Pipeline that creates an ML Model) vs the source code of the installed libraries themselves.
w

witty-crayon-22786

01/05/2021, 10:15 PM
iii see. so creating source distributions for everything and hosting them in a pip compatible server …?
r

rough-minister-58256

01/05/2021, 10:15 PM
yeah, except ideally without the ongoing storage costs of eager build-per-sha.
w

witty-crayon-22786

01/05/2021, 10:16 PM
the source packages you created would need to look pretty content addressed and strange to make that work i think… but it could?
r

rough-minister-58256

01/05/2021, 10:17 PM
and/or without the per-project overhead of build-and-publish to binary index in CI + version bumps.
yeah. I haven’t quite worked out the version<>sha mapping layer.
in particular, representing “master” seems tricky due to immutable version guarantees the index has to provide.
but I do think it’d be quite nice to be able to deterministically pin down a bug to a git sha from a pure python distribution versioning scheme (e.g.
print(module.__file__)
-> indicates src sha.
w

witty-crayon-22786

01/05/2021, 10:22 PM
if
pip
is part of the requirements, that’s one thing… if it’s not, i’d still maybe say that leaning in on using the remote execution CAS api as a glorified, persistent rsync would work fairly well (or just rsync)
r

rough-minister-58256

01/05/2021, 10:22 PM
its not clear if there’s something workable in an e.g.
0.0.0.devN#<sha>
versioning scheme where N monotonically increments + inexact version matching on the pip side (iirc, local version identifiers can be glob matched as e.g.
.*
).
yeah,
pip
here is fundamental.
pip
needs to manage the remainder of the virtual env (composed of both pypi packages + monorepo sourced packages) in a consistent way.
which means e.g. subsequent rounds of
pip
invocations may change the env (mutable) and/or want to seek “newer” versions of the monorepo src packages to resolve conflicts (e.g. if a 3p version is bumped in the monorepo and wants to be realized in the venv).
thats sort of the crux.
w

witty-crayon-22786

01/05/2021, 10:31 PM
yea.
nailing down the mapping should be … relatively straightforward? for the synthetic source packages, you could likely lie and say that they have no dependencies
and then the workspace could have a root package containing all of the transitive deps, flattened and content addressed.
how you maintain that (or lazily construct it when someone hits the server) would be pretty implementation specific, but.
alternatively, give them actual intermediate deps… but i’m not sure it buys you much. would mostly increase resolve time i think.
r

rough-minister-58256

01/05/2021, 10:36 PM
mapping, yes for sure. tho I think you would want explicit deps for the synthetic source packages (incl at the source deps level, recursively via the mapping)? basically, via the mapping they’re no longer synthetic - they’re realized modules.
you’d definitely want to express 3p deps this way too and just let pip handle those.
w

witty-crayon-22786

01/05/2021, 10:37 PM
is there an advantage to the realized synthetic module having pip-level requirements…?
agreed re: 3p
r

rough-minister-58256

01/05/2021, 10:38 PM
much smaller package builds + payloads I think? w/ natural transitivity.
and no-op’ing
w

witty-crayon-22786

01/05/2021, 10:39 PM
the counterpoint with transitive is that it magnifies your storage quite a bit (if you’re storing)
in
a->b->c
, if
c
changes, need to publish new copies of
b
and
c
and pip needs to re-solve/resolve all the intermediate deps. but you have “already executed” the resolve, so can make its job easier
r

rough-minister-58256

01/05/2021, 10:40 PM
true
I don’t imagine long term storage of the materialized dists. 30d caching at best - expect them to get invalidated frequently.
w

witty-crayon-22786

01/05/2021, 10:41 PM
(it also means that
b
and
c
need to be refetched into the workspace, even if they only changed transitively)
r

rough-minister-58256

01/05/2021, 10:41 PM
yeah, our notebooks run in the DC tho at lan speed fetches.
but in practice for a widely sharded target I could imagine slower resolves
cold perf may suffer
prob a lever to experiment with.
w

witty-crayon-22786

01/05/2021, 10:45 PM
for even tiny resolves, pip is … not fast
r

rough-minister-58256

01/05/2021, 10:46 PM
yeah, esp when you consider transitive inline pants v1 build costs 😉
w

witty-crayon-22786

01/05/2021, 10:46 PM
yea, could be a lever… batch size or something.
r

rough-minister-58256

01/05/2021, 10:47 PM
maybe you could tag a python_library w/ a control flag
tags = {'pip.transitive'}
and then point to e.g. “environment” target addresses for single-dist synthesis and lifecycle
tho that might not compose well w/ adhoc uses due to shadowng
w

witty-crayon-22786

01/05/2021, 10:49 PM
depends whether push or pull in the server, maybe
r

rough-minister-58256

01/05/2021, 11:26 PM
definitely curious on others perspectives here, esp @enough-analyst-54434 re: the topic of pip driven modes of pex interop.
e

enough-analyst-54434

01/05/2021, 11:27 PM
There are way too many words and acronyms above for me to grok / comment coherently .... but can the monorepo stuff be handled via pin install -e where the monorepo supplies an sppropriate setup.py ?
r

rough-minister-58256

01/05/2021, 11:28 PM
I think that solution is perceived as non-scalable/too much boilerplate. have to keep setup.py declared deps in lockstep w/ monorepo etc.
w

witty-crayon-22786

01/05/2021, 11:29 PM
the setup.py would be synthetic, i think? but @enough-analyst-54434: i think he’s essentially suggesting lots of synthetic whls with/without setup.py
e

enough-analyst-54434

01/05/2021, 11:29 PM
That sounds like presuming the details of the solution. In v2 at least the setup.py could be autogenerated with ~0 boilerplate.
r

rough-minister-58256

01/05/2021, 11:30 PM
ah, so
pip install -e ./dist/xxx
?
the mode we’re trying to cater to is e.g. “vanilla jupyter notebook on GCP” having trivial reach to the monorepo
(without a copy of the monorepo, or access to pants, etc)
but at that point pants could just produce .zip files (sdists) that could be pip installed?
I think there’s a service component + pants interface component here, so that would definitely work for the latter sure - tho going the extra step of converting to wheel seems easy too.
a generalized service component could shim the build system layer and make that pluggable - this could do pants v1, v2, bazel, gradle, whatever
e

enough-analyst-54434

01/05/2021, 11:37 PM
"“vanilla jupyter notebook on GCP” having trivial reach to the monorepo"
You'll have to excuse me, but this is where you lose me. Maybe you could write down some command lines a rando like me could run that give the current crappy solution and then a description of what you'd like to happen? I have no clue what a GC or any of the other stuff is.
r

rough-minister-58256

01/05/2021, 11:40 PM
https://colab.research.google.com/#create=true ->
!pip install pex
<cmd-enter>
import pex; print(pex.__version__)
<cmd-enter> then the goal state is:
!pip install monorepo.src.python.a.b
with similar results for a mono-repo sourced
python_library
vs pypi package.
e

enough-analyst-54434

01/05/2021, 11:41 PM
OK - ty - much better.
r

rough-minister-58256

01/05/2021, 11:46 PM
(and then being able to
!pip install
additional 3p/monorepo libraries iteratively and have them stay consistent)
e

enough-analyst-54434

01/05/2021, 11:47 PM
Yeah - so an index server that could translate `monorepo.src.python... `or else a daemon that continually emitted / removed synthetic setup.py in a monorepo src tree seen by the notebook allowing
pip install -e monorepo/src/python/... seem like the two obvious ways.
The index server begs the question of what versions mean for monorepo.src... - particularly mixed ones. The daemon that emits setup.py in a src tree doesn't allow that,
r

rough-minister-58256

01/05/2021, 11:48 PM
yep!
e

enough-analyst-54434

01/05/2021, 11:48 PM
K - so what's the question then?
This all just seems like fact so far
w

witty-crayon-22786

01/05/2021, 11:49 PM
content addressed is the conclusion that we came to above.
e

enough-analyst-54434

01/05/2021, 11:49 PM
I don't understand at all. That says nothing about the choice of allowing mixed versions vs disallowing, etc.
I still don't know what the question is. Is the question - do we allow mixed monorepo versions? Or is the question about techinical feasibility or speed or what?
w

witty-crayon-22786

01/05/2021, 11:51 PM
my understanding is that we’re discussing feasibility. i think the assumption is that versions can be a solved problem.
e

enough-analyst-54434

01/05/2021, 11:51 PM
Ah, ok.
w

witty-crayon-22786

01/05/2021, 11:51 PM
(but i’ll let kris answer, sorry.)
r

rough-minister-58256

01/05/2021, 11:51 PM
mixed version control + the mapping layer from python package version to git sha (i.e. how do you describe “master/latest” for unspecified versions or upgrades) is probably the foremost unknown.
e

enough-analyst-54434

01/05/2021, 11:53 PM
Yeah - I don't know. Is there actually a reason to allow mixed monorepo versions? That goes against monorepo source and just makes everything harder. Without it this all gets a lot easier conceptually. You just need a mechanism to set the version of the monorepo as a whole to.
r

rough-minister-58256

01/05/2021, 11:54 PM
also granularity - does each address mapping provide a transitive dist or an intransitive dist w/ transitivity specified as requires (e.g.
python_library(…, deps=['src/python/x/y']…)
export as
requires=['monorepo.src.python.x.y==X']
e

enough-analyst-54434

01/05/2021, 11:55 PM
So .. Is there actually a reason to allow mixed monorepo versions?
r

rough-minister-58256

01/05/2021, 11:55 PM
for correctness, no no reason to mix versions; for speed, maybe
I guess its a matter of target fingerprint over git sha at that point
e

enough-analyst-54434

01/05/2021, 11:56 PM
You guys are way too down in the details. The git sha in my mind just means for whatever dists you want to install from the monorepo, you get dists that come from the same sha, not that you get the whole monorepo at that sha.
Is that the spirit?
w

witty-crayon-22786

01/05/2021, 11:57 PM
yea, that sounds like it.
r

rough-minister-58256

01/05/2021, 11:57 PM
yep
w

witty-crayon-22786

01/05/2021, 11:57 PM
depending on whether the server is populated eagerly/push or lazily/pull, a mapping from “git branch or sha” to “content address” might be 1) optional, 2) only used to get the “root” of the thing you want to install.
☝️ 1
e

enough-analyst-54434

01/05/2021, 11:57 PM
OK. I'm still lost on what the question is. Clearly this is feasible. It sounds like you want to start getting into perf details of how to actually implement.
Is that right?
w

witty-crayon-22786

01/05/2021, 11:58 PM
if it’s eagerly/push, then you could imagine a user running a command to push to the server, and the tool reporting a content address to use to fetch
r

rough-minister-58256

01/05/2021, 11:58 PM
mainly a sanity check, I suppose
good idea/bad idea vs alternative approaches etc.
e

enough-analyst-54434

01/05/2021, 11:59 PM
Ah - ok. Sounds like a great idea to me! A tool to expose dists from a monorepo in a lightweight way so you can interact with tools like pip makes a ton of sense.
👍 1
Since now an IDE can use it too, etc. For example Jetbrains knows how to install missing dists which it does via pip / venv. That stuff would just work, etc.
r

rough-minister-58256

01/06/2021, 12:00 AM
hmmmm yes! totally.
I think this extrapolates to language servers too
e

enough-analyst-54434

01/06/2021, 12:01 AM
Ok. These conversations always are consternating. The details were a wall of text that seem to have little to do with this conclusion. Thanks for bearing with me,
coke 1
r

rough-minister-58256

01/06/2021, 12:02 AM
and yeah, now I get the “eagerly/push or lazily/pull” thought much more clearly re: index building.
k, thanks for all the input. I think I’m going to motivate a proposal for this and attempt to fund through one of my teams - do folks see merit in (co-)building something like this in the open?
w

witty-crayon-22786

01/06/2021, 12:05 AM
that question might depend a bit on the implementation details i think… how generic it is, the push/pull aspect, etc?
e

enough-analyst-54434

01/06/2021, 12:06 AM
Are you trying to do this generaically or with a specific build tool powering it? IE: bazel -> bazel-dist-server -> pip or the more ambitious bazel -> dist-server-protocol -> pip ?
r

rough-minister-58256

01/06/2021, 12:07 AM
we’re unfortunately stuck on pants v1 (and then bazel) but think we can likely generalize build system interfaces there accordingly - the q is whether generalizing is worthwhile to other applications of this system.
e

enough-analyst-54434

01/06/2021, 12:08 AM
I think I owe Stu a Coke.
coke 1
I personally would not generalize until its running.
That said - I'd love Pants v2 to support this at some point.
r

rough-minister-58256

01/06/2021, 12:10 AM
yeah, I think this would be a forward thinking interstitial build system capability for any ML+python shops - hence it may provide value to Toolchain to collab from day 1 for demoability - or we can run with it and collab later.
it remains quite unfortunate to me that we’re wedged into pants pre-v2 but that’s life
I’ll propose an internal prototype/PoC and we can go from there.
👍 1
8 Views