Hey dear people :wave:, another question. First, a...
# general
a
Hey dear people đź‘‹, another question. First, a bit of context: my team uses primarily bazel (it's C++ yeah) and for the sake of UX of the most of the devs, we decided to put pants build-all and pants test-all as bazel targets. For that, we basically wrote a genrule to build everything and some simple starlark test task which runs as
exclusive
test (meaning it will be invoked not in parallel with other build targets). So basically I tried my best to delegate stuff exclusively to pants so that no two invocations run. It works like a charm, yet from time to time (really really rarely so I can't track it down) I get the error in CI:
Copy code
16:27:43.55 [INFO] Completed: Building build_backend.pex from setuptools_default_lockfile.txt
16:27:43.57 [ERROR] 1 Exception encountered:

  Exception: Failed to execute: Process {
    argv: [
        "../build_backend.pex_pex_shim.sh",
        "backend_shim.py",
    ],
    env: {},
    working_directory: Some(
        RelativePath(
            "chroot",
        ),
    ),
    input_digests: InputDigests {
        complete: Digest {
            hash: Fingerprint<23502f8ccd2f9a66600309139bc0bcf8fbb26194c30736d49fd8574d916fb665>,
            size_bytes: 392,
        },
        nailgun: Digest {
            hash: Fingerprint<e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855>,
            size_bytes: 0,
        },
        input_files: Digest {
            hash: Fingerprint<23502f8ccd2f9a66600309139bc0bcf8fbb26194c30736d49fd8574d916fb665>,
            size_bytes: 392,
        },
        immutable_inputs: {},
        use_nailgun: [],
    },
    output_files: {},
    output_directories: {
        RelativePath(
            "dist",
        ),
    },
    timeout: None,
    execution_slot_variable: None,
    concurrency_available: 0,
    description: "Run setuptools.build_meta:__legacy__ for //:devtools",
    level: Debug,
    append_only_caches: {
        CacheName(
            "pex_root",
        ): RelativePath(
            ".cache/pex_root",
        ),
    },
    jdk_home: None,
    platform_constraint: None,
    cache_scope: Successful,
}

Error launching process: Os { code: 26, kind: ExecutableFileBusy, message: "Text file busy" }
I saw e.g. this https://github.com/pantsbuild/pants/issues/10507 but it seems closed. I know what I'm doing there (putting one build system as a step of another) is really a hack but that's how we currently live, and I know there are too many unknowns that I didn't provide in this post, -- so I just try my luck here - can it be there are some potential explanations for that? Thank you, pants is amazing, I wish C++ is there once, too 🙂 And we'll drop bazel alltogether
e
I missed what you use Pants for (build-all and test-all what?) If Go or JVM, then this is still relevant: https://github.com/pantsbuild/pants/issues/13424
Ah,
../build_backend.pex_pex_shim.sh
- That comes from
python_distribution
building.
a
Sorry yes I forgot the main part – it’s Python…
e
Yeah - this is a general issue that we only have solved for a very narrow case unfortunately. https://github.com/pantsbuild/pants/issues/13424 outlines some ideas, but this will continue to be a problem until that issue is closed.
This is nothing you're doing wrong at all. Its a general tough problem when you mix forking with threads and you write out your own binaries. Pants does all 3!
This was the best run-down on this general Posix problem when I fixed the narrow case: https://github.com/golang/go/issues/22315 Good reading.
a
Thanks @enough-analyst-54434! So generally to reduce flakiness is it good idea to introduce manual retries?
h
It feels icky, but yeah, if it gets things working, and the issue is rare enough, then retries might be the way to go...
e
It's definitely egg on our face to not have this fixed yet. We were a small team, but we're growing with help from more and more OSS contributors daily; so hopefully we'll manage the time to fix this soon enough.
f
about the only real way to solve this issue in a way that works on both Linux and macOS is to redesign the Pants local executor to spawn a separate process to handle writes into the execution sandbox so that there is no sharing with the process executing the build action. and that is probably a non-trivial amount of work versus just adding retries in places.
w
regarding the error at the top of the thread: that is actually supposed to be the narrow case which we have already addressed, I think? it’s an OS error directly from the “root” process (whereas https://github.com/pantsbuild/pants/issues/13424 deals with indirection where the root process spawns other processes, and those fail)
so perhaps something about our detection of the need for locking for that case just isn’t working… maybe because it’s relative?
e
Nope. There is a second script in that chroot we emit, the backend shim.
w
sure. but afaik,
Error launching process: Os { code: 26, kind: ExecutableFileBusy, message: "Text file busy" }
means the first argument is the thing experiencing the issue… it’s not able to start.
so anything else in the sandbox wouldn’t be relevant
e
I'm not confident in the reasoning, but its all moot IIUC. We know how to solve this correctly but keep trying to punt because we don't have time.
Didn't follow my own rule and wasted some time looking, This is definitely not from the root process because we have a retry loop that says: "Error launching process after {} {} for ETXTBSY. Final error was: {:?}"
So hand-wave-and-wave-some-more general brokenness that would be good to fix the right way here.
w
well, my hypothesis was that we didn’t decide to enter the loop at all, because we failed to match it as a root process (perhaps due to being relative)… i.e. that the workaround wasn’t triggered
e
Ah, ok. bad me. Sucked in.
w
i can look at it.
e
I think you may be right.
w
mm, yea: it’s probably failing to convert to a RelativePath because:
Copy code
Relative paths that escape the root are not allowed
i’ll get a patch out. @acceptable-football-32760: thanks for the report!
e
There is an attempt to calculate this based in
working_directory
though (
chroot
in this case). So the bug will be interesting since ../foo + chroot working_dir should == "foo".
w
https://github.com/pantsbuild/pants/pull/14812 @acceptable-football-32760: which version of Pants are you using?
a
Wow guys you’re good! @witty-crayon-22786 it’s 10rc0 still
đź‘Ť 1
❤️ 1
h
Thanks! It's so helpful having bug reports like this. Cool, this change will be cherry-picked into 2.10 and I plan to do another release this evening
🤩 1
a
Thanks, we've pulled the new version! Once again this is some amazing fast response!
đź‘Ť 1
❤️ 2