Hi all. We have added pants to our GitLab CI/CD pi...
# general
f
Hi all. We have added pants to our GitLab CI/CD pipeline with three jobs that concurrently run
tailor
,
lint
, and
test
in the same runner. We’ve found that this intermittently results in the failure of one or more of the jobs with the error message
ModuleNotFoundError: No module named 'pants'
. Re-running the failed job typically allows it to complete successfully so we suspect that this is being caused by our use of concurrent stages. We are running concurrently to reduce the time the pipeline runs, but perhaps it makes more sense to run these steps serially (i.e., to catch missing BUILD files before running the lint and test steps). Is there a better way we should be running pants in CI? Thanks in advance!
e
What version of Pants are you using and can you provide ~full output from one of these failing jobs?
Pants should handle full parallelism or else it's a bug.
I'm guessing / hoping this error is during the pants bootstrap phase, which is the least likely to be hardened to parallelism since Pants is typically installed once serially.
On the higher level - how should I set up CI, this does not answer your question, but I hope you've found https://www.pantsbuild.org/docs/using-pants-in-ci more generally.
f
We’re using 2.12. The full output is pretty brief:
Copy code
$ ./pants --version
Traceback (most recent call last):
  File "/home/gitlab-runner/builds/s3DRJHKh/0/directory/subdirectory/.cache/pants/setup/bootstrap-Linux-x86_64/2.12.0_py38/bin/pants", line 7, in <module>
    from pants.bin.pants_loader import main
ModuleNotFoundError: No module named 'pants'
This is the output in the
tailor
job and you can see it didn’t even get to the
tailor
command. The
test
job also failed with a different error:
Copy code
$ ./pants check test ::
Traceback (most recent call last):
  File "/home/gitlab-runner/builds/s3DRJHKh/1/directory/subdirectory/.cache/pants/setup/bootstrap-Linux-x86_64/2.12.0_py38/bin/pants", line 10, in <module>
    sys.exit(main())
  File "/home/gitlab-runner/builds/s3DRJHKh/1/directory/subdirectory/.cache/pants/setup/bootstrap-Linux-x86_64/2.12.0_py38/lib/python3.8/site-packages/pants/bin/pants_loader.py", line 115, in main
    PantsLoader.main()
  File "/home/gitlab-runner/builds/s3DRJHKh/1/directory/subdirectory/.cache/pants/setup/bootstrap-Linux-x86_64/2.12.0_py38/lib/python3.8/site-packages/pants/bin/pants_loader.py", line 111, in main
    cls.run_default_entrypoint()
  File "/home/gitlab-runner/builds/s3DRJHKh/1/directory/subdirectory/.cache/pants/setup/bootstrap-Linux-x86_64/2.12.0_py38/lib/python3.8/site-packages/pants/bin/pants_loader.py", line 93, in run_default_entrypoint
    exit_code = runner.run(start_time)
  File "/home/gitlab-runner/builds/s3DRJHKh/1/directory/subdirectory/.cache/pants/setup/bootstrap-Linux-x86_64/2.12.0_py38/lib/python3.8/site-packages/pants/bin/pants_runner.py", line 89, in run
    return remote_runner.run(start_time)
  File "/home/gitlab-runner/builds/s3DRJHKh/1/directory/subdirectory/.cache/pants/setup/bootstrap-Linux-x86_64/2.12.0_py38/lib/python3.8/site-packages/pants/bin/remote_pants_runner.py", line 117, in run
    return self._connect_and_execute(pantsd_handle, start_time)
  File "/home/gitlab-runner/builds/s3DRJHKh/1/directory/subdirectory/.cache/pants/setup/bootstrap-Linux-x86_64/2.12.0_py38/lib/python3.8/site-packages/pants/bin/remote_pants_runner.py", line 151, in _connect_and_execute
    return PyNailgunClient(port, executor).execute(command, args, modified_env)
native_engine.PantsdClientException: The pantsd process was killed during the run.
If this was not intentionally done by you, Pants may have been killed by the operating system due to memory overconsumption (i.e. OOM-killed). You can set the global option `--pantsd-max-memory-usage` to reduce Pantsd's memory consumption by retaining less in its in-memory cache (run `./pants help-advanced global`). You can also disable pantsd with the global option `--no-pantsd` to avoid persisting memory across Pants runs, although you will miss out on additional caching.
If neither of those help, please consider filing a GitHub issue or reaching out on Slack so that we can investigate the possible memory overconsumption (<https://www.pantsbuild.org/docs/getting-help>).
The third job (
lint
) completed successfully.
e
Thanks @flaky-artist-57016 - yeah, this is the Pants bootstrap venv:
.../.cache/pants/setup/...
; so, if it is the case that multiple jobs use that same directory, there is likely a bootstrap race we're vulnerable to. Is it in fact the case that all 3 jobs see
/home/gitlab-runner/builds/s3DRJHKh/0/directory/subdirectory/.cache/pants/setup
? (Excuse my gitlab CI ignorance)
f
Yes it looks like all three see that directory. In
pants.ci.toml
we have set
local_store_dir
and
named_caches_dir
to
.cache/pants/lmdb_store
and
.cache/pants/named_caches
so they are within the git repository rather than in the default
$HOME/.cache
location. Could this be a problem?
e
No. I think the only problem is the bootstrap (
.cache/pants/setup
) - the lmdb_store and named_caches have been battle tested. Those get hammered concurrently even in a single Pants run let alone by concurrent Pants runs. It is just the Pants install bootstrap that can't handle concurrency here. We implicitly assume install is serial and here for you it is not. 3 jobs try to install Pants to the same directory in parallel and that - and that alone - is what sometimes fails.
f
Ah I see.
e
Can you add a 4th job the other 3 depend on that just runs
./pants -V
?
f
Sure.
e
That would bootstrap Pants serially then run against the installed Pants in parallel downstream.
Great. To re-iterate - this can only happen if all jobs share the exact same
.../.cache/pants/setup
directory.
f
Understood. We are setting
PANTS_SETUP_CACHE: "$CI_PROJECT_DIR/.cache/pants/setup"
as a global variable for the pants jobs in our
.gitlab-ci.yml
e
Gotcha.
f
Our pipeline ran without issues after adding the bootstrap job. I’ll keep an eye on it, but hopefully that does the trick. Thanks for your help @enough-analyst-54434!
I spoke too soon. We are still seeing the pants CI jobs failing after adding a bootstrap step before the tailor/test/lint jobs that run in parallel. It seems like the issue I reported above may be related to GitLab’s cacheing (i.e., the pants cache is saved and removed after a short-lived job completes while another job is still running and attempting to access the same cache), but it’s hard to determine the cause. For now we’ve reverted to running the pants checks sequentially in a single job as it doesn’t take very long.
BTW, here is the pants section of the
.gitlab-ci.yml
file that was giving us trouble:
Copy code
.pants_base:
  stage: pants
  # Global variables for pants goals
  variables:
    PIP_CACHE_DIR: "$CI_PROJECT_DIR/.cache/pip"
    PANTS_SETUP_CACHE: "$CI_PROJECT_DIR/.cache/pants/setup"
    PANTS_CONFIG_FILES: pants.ci.toml
  cache:
    paths:
      - .cache/pip
      - .cache/pants/setup
      - .cache/pants/named_caches/pex_root/pip.pex
      - .cache/pants/named_caches/pex_root/http
      - .cache/pants/named_caches/pex_root/built_wheels
      - .cache/pants/lmdb_store
  tags:
    - bash

pants_bootstrap:
  extends: .pants_base
  script:
    - './pants -V'

# the following jobs run in parallel after the bootstrap job finishes successfully
pants_tailor:
  extends: .pants_base
  needs: ["pants_bootstrap"]
  script:
    - './pants --version'
    - './pants tailor --check update-build-files --check'

pants_test:
  extends: .pants_base
  needs: ["pants_bootstrap"]
  script:
    - './pants --version'
    - './pants check test ::'
  coverage: '/(?i)total.*? (100(?:\.0+)?\%|[1-9]?\d(?:\.\d+)?\%)$/'
  artifacts:
    paths:
      - coverage.xml
    reports:
      coverage_report:
        coverage_format: cobertura
        path: coverage.xml

pants_lint:
  needs: ["pants_bootstrap"]
  extends: .pants_base
  script:
    - './pants --version'
    - './pants lint ::'
e
What happens if you remove the cache entry for
.cache/pants/setup
?
I.E.: Just let the `pants_bootstrap`job do the whole bootstrap each time uncached. Clearly slower, but it would be a good debug step to see if that makes issues go away.
f
With this setup, the
.pants_base
section outlines the caches used by the subsequent jobs (via
extends: .pants_base
) so I believe that removing the
.cache/pants/setup
entry from that section would result in the tailor/test/lint jobs performing the bootstrap process as well as they each restore the cache. Testing now to confirm.
The suggested change does result in each job running the bootstrap step with our CI configuration. While all jobs have completed successfully a few times, I’m hesitant to say this “fixes” the issue given the intermittent nature of the errors I reported. We are just going to run the pants checks sequentially until we have a real need to run them in parallel. Thanks again for your help.
e
Ok. It sounds like to be of more help, a maintainer will probably need to dig and set up a gitlab CI job to repro intermittent failures and learn better what's going on.
👍 1