I’ve been testing out the upgrade from pants 2.2.3...
# general
h
I’ve been testing out the upgrade from pants 2.2.3 to 2.3.2 which brings in venv mode test execution and I’m running into some failures. Specifically, it appears that some code using spacy.cli.link functionality causes spacy to write some symlinks inside it’s spacy/data directory. For non-venv stuff, this means it writes directly to a fingerprinted dir under
.cache/pex_root/installed_wheels/
However, with venv stuff, it appears that it pulls that cached wheel to a venv dir but the resulting symlinks are converted to empty directories in the spacy/data dir and then subsequent spacy.cli.link fails because it can’t overwrite a directory. 1. Is there a way to prevent a non-venv execution from writing those symlinks to the installed_wheels dir? E.g. the spacy package is pulled into a temp location from installed_wheels? 2. Is there a way to selectively disable venv test execution for specific targets. (It’s a less ideal situation but I was thinking about excluding those specific targets from venv execution mode to workaround this immediate problem) 3. Any other ideas?
e
This is a bit hard to wrap my head around without an example. I'm trying to gin one up now, but if you've got more details on a repro, that would be a great shortcut too.
h
let me try to wip something up in my fork of the pants repo
e
I tried this - eliminating Pants from the picture to simplify a bit:
Copy code
$ python -mpex "spacy[transformers,lookups]" pip setuptools wheel -o spacy.pex --venv -cspacy
Then:
Copy code
$ ./spacy.pex download en_core_web_sm
...
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.0.0
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')
Then:
Copy code
$ PEX_INTERPRETER=1 ./spacy.pex
Python 3.9.5 (default, May 24 2021, 12:50:35) 
[GCC 11.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> import spacy
>>> spacy.load('en_core_web_sm')
<spacy.lang.en.English object at 0x7f80a971ad00>
>>>
Clearly I'm not exercising the spacy code that creates the symlinks.
I'll try Pex 2.1.35 which is the version corresponding to Pants 2.3.2...
Ok, yeah - same result under Pex 2.1.35. @helpful-lunch-92084 an example we can share is great, but if that's too hard, then just some directory listings showing the structure with symlinks that works in non-venv mode and results in empty directories in venv mode would be good too.
h
ok
let me dig those up one sec
e
Ok, great. The exact version of spacy / relevant requirement strings you're using would be good too to make sure we're debugging the same thing.
h
yah i’m currently using pants 2.3.2 with the interpreter_constraint set on the target as 3.6.9 and spacy 2.1.8
here’s an example directory structure of the spacy wheel after running spacy link:
Copy code
$ ls -l /Users/nate/.cache/pex_root/installed_wheels/8931db6716ebfa82fc6520460521223fd2366c25/spacy-2.1.8-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl/spacy/data/
total 0
-rw-r--r--  1 nate  staff    0 Jun 10 16:54 __init__.py
lrwxr-xr-x  1 nate  staff  154 Jun 10 17:04 de -> /Users/nate/.cache/pex_root/installed_wheels/7268368e25395cf6c42013c335a73814ed98aa6d/de_core_news_sm-2.1.0-py3-none-any.whl/de_core_news_sm
lrwxr-xr-x  1 nate  staff  168 Jun 10 17:04 en -> /Users/nate/.cache/pex_root/installed_wheels/83896cf122478301e33e164bba3da275221928b9/en_core_web_sm_textcat-1.0.0-py3-none-any.whl/en_core_web_sm_textcat
lrwxr-xr-x  1 nate  staff  154 Jun 10 17:04 es -> /Users/nate/.cache/pex_root/installed_wheels/334e0324356daf2d8a2723972d443dba75438fa5/es_core_news_sm-2.1.0-py3-none-any.whl/es_core_news_sm
lrwxr-xr-x  1 nate  staff  154 Jun 10 17:04 fr -> /Users/nate/.cache/pex_root/installed_wheels/6be69bb4dd9dc9f15f408a93eeec71305c30992e/fr_core_news_sm-2.1.0-py3-none-any.whl/fr_core_news_sm
lrwxr-xr-x  1 nate  staff  154 Jun 10 17:04 it -> /Users/nate/.cache/pex_root/installed_wheels/22a7ffc29f77c01df0e180b684656ff053cc348a/it_core_news_sm-2.1.0-py3-none-any.whl/it_core_news_sm
lrwxr-xr-x  1 nate  staff  154 Jun 10 17:04 nl -> /Users/nate/.cache/pex_root/installed_wheels/ae15833bb235f37d29cfa58b4def63a25aa41e38/nl_core_news_sm-2.1.0-py3-none-any.whl/nl_core_news_sm
lrwxr-xr-x  1 nate  staff  154 Jun 10 17:04 pt -> /Users/nate/.cache/pex_root/installed_wheels/87b055365ff31407a833da909478cfe8c3d584cd/pt_core_news_sm-2.1.0-py3-none-any.whl/pt_core_news_sm
lrwxr-xr-x  1 nate  staff  152 Jun 10 17:04 xx -> /Users/nate/.cache/pex_root/installed_wheels/0c6b6d94cf903727069b8dfcbc23f76d1a89a772/xx_ent_wiki_sm-2.1.0-py3-none-any.whl/xx_ent_wiki_sm
and here’s the venv dir after changing the target’s execution_mode to venv:
Copy code
$ ls -l /Users/nate/.cache/pex_root/venvs/d8e45253502f755be2885b7881abc87f86046363/93d314ee448ec5f1466875cd37c0ab55bda4655a/lib/python3.6/site-packages/spacy/data/
total 0
-rw-r--r--  2 nate  staff   0 Jun 10 16:54 __init__.py
drwxr-xr-x  2 nate  staff  64 Jun 10 17:06 de
drwxr-xr-x  2 nate  staff  64 Jun 10 17:06 en
drwxr-xr-x  2 nate  staff  64 Jun 10 17:06 es
drwxr-xr-x  2 nate  staff  64 Jun 10 17:06 fr
drwxr-xr-x  2 nate  staff  64 Jun 10 17:06 it
drwxr-xr-x  2 nate  staff  64 Jun 10 17:06 nl
drwxr-xr-x  2 nate  staff  64 Jun 10 17:06 pt
drwxr-xr-x  2 nate  staff  64 Jun 10 17:06 xx
e
... after running spacy link:
Ok, the spacy 3.0.6 CLI has no
spacy link
. I'm guessing spacy 2.1.8 does? About to find out.
h
we’re executing it in code via spacy.cli.link
e
Can you give me something super explicit? What line of code or what command line?
h
yah one sec
spacy.cli.link("es_core_news_sm", "es", force=True)
e
ty
h
now
es_core_news_sm
is a package on our sys.path, not sure if spacy download does the same thing (we happened to bundle that into a wheel awhile back when pex was zipapp only)
e
It does look like
download
just uses pip to download a dist and place it in the venv site-packages.
h
ah cool
Hrm. I do not repro. Here's my setup: 1. Prepare the data (Italian was the closest I could find for 2.1.0):
Copy code
$ curl -sSL <https://github.com/explosion/spacy-models/releases/download/it_core_news_sm-2.1.0/it_core_news_sm-2.1.0.tar.gz> -O
$ pip wheel --no-deps it_core_news_sm-2.1.0.tar.gz
Processing ./it_core_news_sm-2.1.0.tar.gz
  File was already downloaded /home/jsirois/Downloads/it_core_news_sm-2.1.0.tar.gz
Building wheels for collected packages: it-core-news-sm
  Building wheel for it-core-news-sm (setup.py) ... done
  Created wheel for it-core-news-sm: filename=it_core_news_sm-2.1.0-py3-none-any.whl size=11123295 sha256=4b6007baa80a7ba020c1a09a169192e05bc97383d0922e080d13140bf9f4029f
  Stored in directory: /home/jsirois/.cache/pip/wheels/1b/b5/49/5970302bebb331f699e409a6eb0c3fb8aa8ceebbc30be5cb4e
Successfully built it-core-news-sm
2. Build the --venv PEX:
Copy code
jsirois@gill ~/dev/pantsbuild/pex ((v2.1.35)) $ python3.6 -mpex "spacy==2.1.8" setuptools ~/Downloads/it_core_news_sm-2.1.0-py3-none-any.whl -o spacy-2.1.8.pex-2.1.35.venv --venv
3. Run it:
Copy code
$ PEX_ROOT=/tmp/spacy-2.1.8.pex-2.1.35.venv ./spacy-2.1.8.pex-2.1.35.venv
Python 3.6.13 (default, Feb 16 2021, 20:57:41) 
[GCC 10.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> import spacy.cli.link
>>> spacy.cli.link("it_core_news_sm", "it", force=True)
✔ Linking successful
/tmp/spacy-2.1.8.pex-2.1.35.venv/venvs/short/86d8d0fa/lib/python3.6/site-packages/it_core_news_sm
-->
/tmp/spacy-2.1.8.pex-2.1.35.venv/venvs/short/86d8d0fa/lib/python3.6/site-packages/spacy/data/it
You can now load the model via spacy.load('it')
>>> spacy.load('it')
<spacy.lang.it.Italian object at 0x7f0bb92beac8>
>>> 
now exiting InteractiveConsole...
4. Check it:
Copy code
^jsirois@gill ~/dev/pantsbuild/pex ((v2.1.35)) $ ls -l /tmp/spacy-2.1.8.pex-2.1.35.venv/venvs/short/86d8d0fa/lib/python3.6/site-packages/spacy/data/
total 0
-rw-r--r-- 2 jsirois jsirois  0 Jun 10 14:51 __init__.py
lrwxrwxrwx 1 jsirois jsirois 97 Jun 10 14:51 it -> /tmp/spacy-2.1.8.pex-2.1.35.venv/venvs/short/86d8d0fa/lib/python3.6/site-packages/it_core_news_sm
Can you think of anything I'm doing there @helpful-lunch-92084 that's not faithful to your situation?
h
yah so the one thing we do before this is we want run it as a zipapp, do the spacy linking and then re-run it as a venv and retry the linking. This would simulate doing a pants repl on a python_library target and then running a pants test that uses that target
e
Aha - ok. That does sound more likely to be weird since the zipapp wheel is mutating itself which is definitely unexpected.
h
yah
e
Yeah, so I get: ZIPAPP:
Copy code
$ PEX_ROOT=/tmp/spacy-2.1.8.pex-2.1.35 ./spacy-2.1.8.pex-2.1.35
Python 3.6.13 (default, Feb 16 2021, 20:57:41) 
[GCC 10.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> import spacy.cli.link
>>> spacy.cli.link("it_core_news_sm", "it", force=True)
✔ Linking successful
/tmp/spacy-2.1.8.pex-2.1.35/installed_wheels/106e0fcb67b8a740d3fa416bbc6cdc09b36e5c09/it_core_news_sm-2.1.0-py3-none-any.whl/it_core_news_sm
-->
/tmp/spacy-2.1.8.pex-2.1.35/installed_wheels/2aeeab03e5348116d101c8d9a67a30e2671dfe59/spacy-2.1.8-cp36-cp36m-manylinux1_x86_64.whl/spacy/data/it
You can now load the model via spacy.load('it')
>>> spacy.load('it')
<spacy.lang.it.Italian object at 0x7f86efa65518>
>>> 
now exiting InteractiveConsole...
Then VENV:
Copy code
$ PEX_ROOT=/tmp/spacy-2.1.8.pex-2.1.35 ./spacy-2.1.8.pex-2.1.35.venv
Python 3.6.13 (default, Feb 16 2021, 20:57:41) 
[GCC 10.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> import spacy.cli.link
>>> spacy.cli.link("it_core_news_sm", "it", force=True)

✘ Can't overwrite symlink 'it'
This can happen if your data directory contains a directory or file of the same
name.
h
yep, that’s what i get
e
At a loss for the moment. Pex could be verifying hashes; ie. check that the wheel directory tree spacy-2.1.8-cp36-cp36m-manylinux1_x86_64.whl hashes to 2aeeab03e5348116d101c8d9a67a30e2671dfe59 before re-using it every time. That would slow all PEX operations down though by X amount... thinking on this.
Stepping aside from a direct attack - the words
force=True
seem to imply that spacy should not complain like it does. I wonder if I'm reading that wrong or its a bug fixed in later versions of spacy?
h
yah, i could also be convinced to just update our code to just stop doing the linking inside the wheel like we are. i guess i flagged it because it seems like a more general proble
i think the force=True applies only to symlinks
e
Wheel mutation is a general problem - its just that we've had 0 reports of it.
I think this is rare for wheels to be mutating themselves.
h
in fact spacy 3.0 complains to me about symlinking saying it’s no longer supported. so we’d probably have to move the symlinking outside the wheel anyway
Copy code
>>> spacy.cli.link("en_core_web_sm", "en")
⚠ As of spaCy v3.0, model symlinks are not supported anymore. You can
load trained pipeline packages using their full names or from a directory
path.
e
Ah, ok. Are you satisfied with this direction then? I'd be happy to stop thinking about Pex gaurding against self-mutating wheels 🙂
h
lol, yah, i can move in that direction
e
Ok, great. Thanks for walking me through all that. I was slow to see the mechanism and the concrete example really helped.
h
sure thing, thanks for taking the time
e
I guess the simple path for Pex to support this edge case would be to treat it as one and force you to specify a PEX build-time flag marking the PEX file with --mutable-sys-path or somesuch. For those PEX files, and only those, the PEX would always re-extract its dependencies instead of using pre-extracted ones in ~/.pex/installed_wheels. Pants would then have to expose that knob on
pex_binary
.
And that would not help you at all, since the issue is in tests. The test binary .... yeah - sortof ugly. We'd need to support this knob on
python_tests
too.
h
yah, the one thing that kind of caught me up is that the venv mode seemed to copy that wheel but instead of preserving the symlinks it created bare directories
e
Yeah - the root of that is the fact its based on a "Pex tool". When you install
pex
now you get
pex-tools
in addition to
pex
. You can also build a PEX with tools via
--include-tools
. Then run the resulting PEX file with
PEX_TOOLS=1 ./my.pex
One of those tools is
venv
which creates a venv from your PEX file at the directory you specify. In that case, you really never want symlinks since someone could then nuke the Pex cache and kill your venv out over here elsewhere. Its only for
--venv
execution mode, which runs the pex venv tool implicitly placing then venv inside the Pex cache (~/pex/venvs/...) that it would actually OK to preserve symlinks.
Since then, nuking the Pex cache also nukes the implicitly created Pex `--venv`s
As it stands, Pex always attempts hard links for files - and thus makes dirs - only falling back to copying when the location is cross-device.
I guess adding a
--symlink
mode for venv creation could solve this case ... but only if we made it the default / Pants always used it or we did the same ugly plumbing mentioned above out to
pex_binary
and
python_tests
targets,
h
ah gotcha that makes sense now. yah, in our case we’ll fix on our end. hopefully this helps someone else who might run into this same issue with spacy in the future