FYI, I'm debugging a locking issue with <@U6ZRNH0T...
# development
c
FYI, I'm debugging a locking issue with @aloof-angle-91616 in #general; I just got repro (https://github.com/ns-cweber/pants-repro) and it looks like it's coming from the engine (coming out of lmdb specifically). Seems relevant for this channel as well.
👍 1
e
This most naturally seems an artifact of containers. I think we're normally protected by the following code hierarchy:
Copy code
<https://github.com/pantsbuild/pants/blob/33a6d51db91ea1fe69c39117c8878abcb824cdfb/src/python/pants/process/lock.py#L22>
 -> <https://fasteners.readthedocs.io/en/latest/api/process_lock.html>
  -> fcntl.lockf(self.lockfile, fcntl.LOCK_EX | fcntl.LOCK_NB)
The last is an advisory interprocess lock that I imagine is not valid across seperate countainer namespaces.
🔥 1
a
do we have any other implementations of file locks anywhere that already might do the trick? figuring this out now
no it looks like we have centralized around the canonical version, cool
c
I figured it had something to do with locks; still not sure why bash changes the behavior. In any case; is there a lock file that I'm meant to be mounting across containers? The whole cache directory is shared...
a
oh.
.pants.workdir.file_lock
might be that
i believe the other file locks are stored in
.pants.d
, i will run a
find
to see
c
Pretty sure .pants.workdir.file_lock is being shared by the
$PWD:/workdir
volume mount in the docker-compose.yml file.
👍 1
a
ok
c
They have the same timestamp, so I think they're shared.
👍 1
a
that is a good heuristic in this case, i think
c
Copy code
"Path": "/usr/bin/scl",
        "Args": [
            "enable",
            "devtoolset-7",
            "--",
            "./pants",
            "run",
            "package:main"
        ],
a
that's what we do in our CI
in our `Dockerfile`s directly, they're in
build-support/
somewhere
(the
scl
command line)
(also, the
(backtrace omitted)
part is a bad error message -- it means pants wasn't able to get it, not that it decided not to)
(or at least, turning on
PANTS_PRINT_EXCEPTION_STACKTRACE=True
did not change the result)
e
For sanity sake - does everyone here know that
fcntl.lockf(self.lockfile, fcntl.LOCK_EX | fcntl.LOCK_NB)
should definitely work in seperate containers. Sure you share the relevent fs, but do you know what's going on in fcntl? I do not, I'm fairly unix / linux dumb
c
Yeah, I figured out that the error is coming out of the lmdb library (the C library, not the rust wrapper).
I don't know how fcntl works 😞
e
OK - until someone does we'll all be blowing smoke
c
TIL it exists
e
IFF this is reliable info: https://gavv.github.io/articles/file-locks/#differing-features then this wont work. We use 'POSIX record locks' which lock a (inode, pid) pair. If that is correct, the two containers have different pid namespaces so broken
I do know the two containers definitely do have different pid namespaces
I don't know about the veracity of the rest
c
I'm way out of my depth, but does it matter that the file descriptor is mounted into both containers?
e
So, assuming this is all true though, next step @cuddly-window-48195 is to get the two containers using the same pid namespace.
fd maps to inode, thet's only 1/2 of the key.
c
Any tips on how to do that?
e
So there are two problems to contend with
So that solves pid. The indoe bit I'm not sure about. Your current compose shared volume is the best I know how to do off the top. So start with configuring the one container to be in the other's pid namespace then report back.
c
Ok, I set the pid namespace for both and it didn't change the behavior
e
So now
stat
the same lockfile from both containers and compare inode.
c
do we know which lockfile?
e
Doesn't matter. Any file shared between the two containers will do for this experiment.
Now that you know how we do our locking though, you should have everything you need to know to debug this. I need to run!
c
Copy code
[root@89d05d1595d4 workdir]# stat ~/.cache/pants/lmdb_store/
  File: '/root/.cache/pants/lmdb_store/'
  Size: 4096            Blocks: 8          IO Block: 4096   directory
Device: 801h/2049d      Inode: 1097323     Links: 5
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2019-10-16 21:46:55.290751000 +0000
Modify: 2019-10-16 21:46:55.300751000 +0000
Change: 2019-10-16 21:46:55.300751000 +0000
 Birth: -
e
Any file will do, don't introduce a dir as a variable no matter how sensible seeming.
We definitely lock a file, not a dir
And you must do this from both containers and compare results. Inode == then the linked article or my readinf of it likely wrong, != and we have a likely explanation for lock failure.
c
container 0:
Copy code
[root@89d05d1595d4 workdir]# stat ~/.cache/pants/lmdb_store/files/0/lock.mdb
  File: '/root/.cache/pants/lmdb_store/files/0/lock.mdb'
  Size: 8192            Blocks: 8          IO Block: 4096   regular file
Device: 801h/2049d      Inode: 1097326     Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2019-10-16 23:19:12.115466000 +0000
Modify: 2019-10-16 23:19:12.115466000 +0000
Change: 2019-10-16 23:19:12.115466000 +0000
 Birth: -
container 1:
Copy code
File: '/root/.cache/pants/lmdb_store/files/0/lock.mdb'
  Size: 8192            Blocks: 8          IO Block: 4096   regular file
Device: 801h/2049d      Inode: 1097326     Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2019-10-16 23:19:12.115466000 +0000
Modify: 2019-10-16 23:19:12.115466000 +0000
Change: 2019-10-16 23:19:12.115466000 +0000
 Birth: -
Same inode
e
OK - I leave it to you to dig further and file an issue when you've found the answer or cry uncle and just want to dump a summary of the current state of knowledge.
👍 2
c
Ok