Hello! I want to <download and save nltk stopwords...
# general
a
Hello! I want to download and save nltk stopwords but I'm not sure how to achieve this using pants. For more information, I have this BUILD file and I would like to make sure stopwords are also available for this particular project. I would like to do something like this or similar.
Copy code
# src/python/projectA/BUILD

python_sources(
    dependencies=[
        "//:nltk",
    ]
)

python_distribution(
    name="wheel",
    dependencies=[":projectA"],
    provides=setup_py(
        name="projectA",
    ),
    wheel=True,
)
Can someone help point me to the right direction? Thanks so much.
h
Hello! Is that not working?
a
There is nothing wrong the BUILD file itself but I was wondering about the installation of stopwords. For the stopwords, they need to be downloaded like this somewhere:
Copy code
import nltk
nltk.download("stopwords")
but I've noticed that with this sometimes the stopwords are not found, maybe due to incomplete download or another issue. Example of the error:
Copy code
raise LookupError(resource_not_found)
E       LookupError: 
E       **********************************************************************
E         Resource stopwords not found.
E         Please use the NLTK Downloader to obtain the resource:
E       
E         >>> import nltk
E         >>> nltk.download('stopwords')
E         
E         For more information see: <https://www.nltk.org/data.html>
E       
E         Attempted to load corpora/stopwords
E       
E         Searched in:
E           - '/root/nltk_data'
E           - '/root/.cache/pants/named_caches/pex_root/venvs/s/7eec7f58/venv/nltk_data'
E           - '/root/.cache/pants/named_caches/pex_root/venvs/s/7eec7f58/venv/share/nltk_data'
E           - '/root/.cache/pants/named_caches/pex_root/venvs/s/7eec7f58/venv/lib/nltk_data'
E           - '/usr/share/nltk_data'
E           - '/usr/local/share/nltk_data'
E           - '/usr/lib/nltk_data'
E           - '/usr/local/lib/nltk_data'
I updated my original post with the type of solution I was wondering about and if it would be possible through pants (or if you have any other suggestions, I could very well be over complicating this haha).
h
Where are you calling
Copy code
import nltk
nltk.download("stopwords")
in your code?
Is that the thing that is sometimes failing with that error?
e
It looks like both on the download end (https://www.nltk.org/_modules/nltk/downloader.html#Downloader) and on the use end (https://www.nltk.org/api/nltk.data.html#nltk.data.find as well as nltk.data.retrieve and nltk.data.load) you can exercise more control than the stack overflow results allude to. I'd start experimenting with taking complete control over download locations 1st.
Is the end goal @ambitious-petabyte-59095 to include the downloaded nltk data in your python distribution so that when someone
pip install
s it later, they don't need to download that data themselves or else the code in the python distribution doesn't need to do it for them just-in-time before proceeding to use it?
👍 1
And - if so - is it a further goal to not download the data manually and check it in to your repo and then use a
files
or
resources
target to depend on it?
👍 1
a
Yes to both @enough-analyst-54434, that is exactly what I am looking for. I want to avoid having any install / download logic in other parts of the code, even tho I do have the option of controlling download locations
e
Yeah, so that stack overflow suggestion uses setuptools and setup_requires. In order to follow that example, you need a custom setup.py and your existing
python_distribution
target uses an auto-generated one (by Pants). You'd need to use `generate_setup=False`(https://www.pantsbuild.org/docs/reference-python_distribution#codegenerate_setupcode) to start with and write up your own
setup.py
. If you go that route, there will be a fair bit of tinkering needed. I'd suggest eliminating Pants complicating factors by 1st writing a minimal
setup.py
that pretty much just created a distribution that successfully contained the data files as well as one console script entrypoint that could proved the data included in the distribution was loadable at runtime. Only after you got that worked out would I return to integrating it with Pants.
👍 2
🙏 1
h
This would be how you'd integrate it with Pants: https://www.pantsbuild.org/docs/python-distributions
You'd have a custom
pyproject.toml
and
setup.py
And the
pyproject.toml
would have
Copy code
[build-system]
requires = ["setuptools==X", "nltk==Y"] # Fill in versions
Which is how you'd bring the
nltk
requirement in at setup.py runtime