FYI I m debugging a locking issue with < aloof angle 91616> Pants #development

FYI, I'm debugging a locking issue with <@U6ZRNH0T...

cuddly-window-48195

10/16/2019, 10:16 PM

FYI, I'm debugging a locking issue with @aloof-angle-91616 in #general; I just got repro (https://github.com/ns-cweber/pants-repro) and it looks like it's coming from the engine (coming out of lmdb specifically). Seems relevant for this channel as well.

👍 1

enough-analyst-54434

10/16/2019, 10:25 PM

This most naturally seems an artifact of containers. I think we're normally protected by the following code hierarchy:

Copy code

<https://github.com/pantsbuild/pants/blob/33a6d51db91ea1fe69c39117c8878abcb824cdfb/src/python/pants/process/lock.py#L22>
 -> <https://fasteners.readthedocs.io/en/latest/api/process_lock.html>
  -> fcntl.lockf(self.lockfile, fcntl.LOCK_EX | fcntl.LOCK_NB)

The last is an advisory interprocess lock that I imagine is not valid across seperate countainer namespaces.

🔥 1

aloof-angle-91616

10/16/2019, 10:40 PM

do we have any other implementations of file locks anywhere that already might do the trick? figuring this out now

aloof-angle-91616

10/16/2019, 10:41 PM

no it looks like we have centralized around the canonical version, cool

cuddly-window-48195

10/16/2019, 10:57 PM

I figured it had something to do with locks; still not sure why bash changes the behavior. In any case; is there a lock file that I'm meant to be mounting across containers? The whole cache directory is shared...

aloof-angle-91616

10/16/2019, 10:57 PM

oh.

.pants.workdir.file_lock

might be that

aloof-angle-91616

10/16/2019, 10:58 PM

i believe the other file locks are stored in

.pants.d

, i will run a

find

to see

cuddly-window-48195

10/16/2019, 10:59 PM

Pretty sure .pants.workdir.file_lock is being shared by the

$PWD:/workdir

volume mount in the docker-compose.yml file.

👍 1

aloof-angle-91616

10/16/2019, 10:59 PM

cuddly-window-48195

10/16/2019, 11:00 PM

They have the same timestamp, so I think they're shared.

👍 1

aloof-angle-91616

10/16/2019, 11:00 PM

that is a good heuristic in this case, i think

cuddly-window-48195

10/16/2019, 11:04 PM

Copy code

"Path": "/usr/bin/scl",
        "Args": [
            "enable",
            "devtoolset-7",
            "--",
            "./pants",
            "run",
            "package:main"
        ],

aloof-angle-91616

10/16/2019, 11:04 PM

that's what we do in our CI

aloof-angle-91616

10/16/2019, 11:05 PM

in our `Dockerfile`s directly, they're in

build-support/

somewhere

aloof-angle-91616

10/16/2019, 11:05 PM

(the

scl

command line)

aloof-angle-91616

10/16/2019, 11:06 PM

(also, the

(backtrace omitted)

part is a bad error message -- it means pants wasn't able to get it, not that it decided not to)

aloof-angle-91616

10/16/2019, 11:06 PM

(or at least, turning on

PANTS_PRINT_EXCEPTION_STACKTRACE=True

did not change the result)

enough-analyst-54434

10/16/2019, 11:06 PM

For sanity sake - does everyone here know that

fcntl.lockf(self.lockfile, fcntl.LOCK_EX | fcntl.LOCK_NB)

should definitely work in seperate containers. Sure you share the relevent fs, but do you know what's going on in fcntl? I do not, I'm fairly unix / linux dumb

cuddly-window-48195

10/16/2019, 11:07 PM

Yeah, I figured out that the error is coming out of the lmdb library (the C library, not the rust wrapper).

cuddly-window-48195

10/16/2019, 11:08 PM

I don't know how fcntl works 😞

enough-analyst-54434

10/16/2019, 11:08 PM

OK - until someone does we'll all be blowing smoke

cuddly-window-48195

10/16/2019, 11:08 PM

TIL it exists

enough-analyst-54434

10/16/2019, 11:09 PM

IFF this is reliable info: https://gavv.github.io/articles/file-locks/#differing-features then this wont work. We use 'POSIX record locks' which lock a (inode, pid) pair. If that is correct, the two containers have different pid namespaces so broken

enough-analyst-54434

10/16/2019, 11:10 PM

I do know the two containers definitely do have different pid namespaces

enough-analyst-54434

10/16/2019, 11:10 PM

I don't know about the veracity of the rest

cuddly-window-48195

10/16/2019, 11:11 PM

I'm way out of my depth, but does it matter that the file descriptor is mounted into both containers?

enough-analyst-54434

10/16/2019, 11:11 PM

So, assuming this is all true though, next step @cuddly-window-48195 is to get the two containers using the same pid namespace.

enough-analyst-54434

10/16/2019, 11:11 PM

fd maps to inode, thet's only 1/2 of the key.

cuddly-window-48195

10/16/2019, 11:12 PM

Any tips on how to do that?

enough-analyst-54434

10/16/2019, 11:12 PM

So there are two problems to contend with

enough-analyst-54434

10/16/2019, 11:13 PM

PID: https://docs.docker.com/engine/reference/run/#pid-equivalent

enough-analyst-54434

10/16/2019, 11:14 PM

So that solves pid. The indoe bit I'm not sure about. Your current compose shared volume is the best I know how to do off the top. So start with configuring the one container to be in the other's pid namespace then report back.

cuddly-window-48195

10/16/2019, 11:14 PM

Ok, I set the pid namespace for both and it didn't change the behavior

enough-analyst-54434

10/16/2019, 11:15 PM

So now

stat

the same lockfile from both containers and compare inode.

cuddly-window-48195

10/16/2019, 11:16 PM

do we know which lockfile?

enough-analyst-54434

10/16/2019, 11:16 PM

Doesn't matter. Any file shared between the two containers will do for this experiment.

enough-analyst-54434

10/16/2019, 11:17 PM

Now that you know how we do our locking though, you should have everything you need to know to debug this. I need to run!

cuddly-window-48195

10/16/2019, 11:17 PM

Copy code

[root@89d05d1595d4 workdir]# stat ~/.cache/pants/lmdb_store/
  File: '/root/.cache/pants/lmdb_store/'
  Size: 4096            Blocks: 8          IO Block: 4096   directory
Device: 801h/2049d      Inode: 1097323     Links: 5
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2019-10-16 21:46:55.290751000 +0000
Modify: 2019-10-16 21:46:55.300751000 +0000
Change: 2019-10-16 21:46:55.300751000 +0000
 Birth: -

enough-analyst-54434

10/16/2019, 11:18 PM

Any file will do, don't introduce a dir as a variable no matter how sensible seeming.

enough-analyst-54434

10/16/2019, 11:18 PM

We definitely lock a file, not a dir

enough-analyst-54434

10/16/2019, 11:19 PM

And you must do this from both containers and compare results. Inode == then the linked article or my readinf of it likely wrong, != and we have a likely explanation for lock failure.

cuddly-window-48195

10/16/2019, 11:19 PM

container 0:

Copy code

[root@89d05d1595d4 workdir]# stat ~/.cache/pants/lmdb_store/files/0/lock.mdb
  File: '/root/.cache/pants/lmdb_store/files/0/lock.mdb'
  Size: 8192            Blocks: 8          IO Block: 4096   regular file
Device: 801h/2049d      Inode: 1097326     Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2019-10-16 23:19:12.115466000 +0000
Modify: 2019-10-16 23:19:12.115466000 +0000
Change: 2019-10-16 23:19:12.115466000 +0000
 Birth: -

container 1:

Copy code

File: '/root/.cache/pants/lmdb_store/files/0/lock.mdb'
  Size: 8192            Blocks: 8          IO Block: 4096   regular file
Device: 801h/2049d      Inode: 1097326     Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2019-10-16 23:19:12.115466000 +0000
Modify: 2019-10-16 23:19:12.115466000 +0000
Change: 2019-10-16 23:19:12.115466000 +0000
 Birth: -

cuddly-window-48195

10/16/2019, 11:21 PM

Same inode

enough-analyst-54434

10/16/2019, 11:22 PM

OK - I leave it to you to dig further and file an issue when you've found the answer or cry uncle and just want to dump a summary of the current state of knowledge.

👍 2

cuddly-window-48195

10/16/2019, 11:24 PM

Open in Slack

Previous Next