Hello, We have some issues with out Pants (2.14) ...
# general
r
Hello, We have some issues with out Pants (2.14) jobs sometimes get stuck when running in Jenkins.
Copy code
./pants tailor --check update-build-files --check ::
Logs:
Copy code
11:58:10    360.38s   Test binary /bin/python.
11:58:10    360.38s   Test binary /data/env/py3.9.13/bin/python.
11:58:10    360.38s   Test binary /opt/conda/bin/python.
Why is it "testing" multiple Python binaries? The only one that matters is one py3.9.13. One interesting observation is that if I connect to node and check processes I see two pantd running. What could be the issue? What are the steps to troubleshoot it?
w
yikes, never seen that before! … those should be very small processes, which usually complete in a few dozen milliseconds.
if you’re able to actually inspect the
Test
processes to confirm that they are running, you might be able to attach to them to trace them
r
and it does, until it does not
w
the first one should be a
bash
process, while the latter two will be
python
processes (via pex)
r
How Pants works if there are multiple Python interpreters in the path?
w
can look for child processes of
pantsd
How Pants works if there are multiple Python interpreters in the path?
that’s probably not related to your hang, but: the “test” that is running is just to query the interpreter for its version, approximately with
--version
r
Is it how it choose python interpreter based on constraint?
w
correct
r
Why do you think it is not related? Jenkins logs keep showing the same "task", as it's the only processes running at that particular stage.
Copy code
11:57:10  15:57:07.20 [INFO] Long running tasks:
11:57:10    300.28s    Test binary /bin/python.
11:57:10    300.28s    Test binary /data/env/py3.9.13/bin/python.
11:57:10    300.28s    Test binary /opt/conda/bin/python.
11:57:42  15:57:37.26 [INFO] Long running tasks:
11:57:42    330.33s    Test binary /bin/python.
11:57:42    330.33s    Test binary /data/env/py3.9.13/bin/python.
11:57:42    330.33s    Test binary /opt/conda/bin/python.
After restarting build several times, it somehow start to work. But it is very annoying.
w
i just mean that the primary question you need to answer is: why are (very) small processes hanging? as mentioned above: you should check to see that those processes actually exist as children of the
pantsd
process
if they don’t, then it might be a pants bug
r
any hints how to do it?
also why there is two pants processes?
w
yes: use
ps -ef | grep pantsd
to find the PID of the
pantsd
process
if
pantsd
is in use, there will be one client process, and one server process.
yes: use
ps -ef | grep pantsd
to find the PID of the
pantsd
process
and then
ps -ef | grep $PANTSD_PID
to find all processes which are children of
pantsd
r
I ran ps -ef on the node, this is how I saw two pantsd. This is docker container, so there were only few processes there at all. I did not pay attention if there were child processes.
It stuck again. These are all processes:
Copy code
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 17:41 ?        00:00:00 /usr/bin/dumb-init -- /usr/local/bin/run-jnlp-client 00be9bdc4f08351618bb9b562179fa7ee56a6b3aa8b1177143ec0fd6f06a55b5 pants-cj0tt
root           7       1 18 17:41 ?        00:00:27 java -XX:+UseParallelGC -XX:MinHeapFreeRatio=5 -XX:MaxHeapFreeRatio=10 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90 -cp /home/jenkins/remoting.jar hudson.remoting.jnlp.Main -headless -u
root         348       1  3 17:42 ?        00:00:04 pantsd [/tmp/workspace/horizon_poc_PR-11]
root         450       1  0 17:42 ?        00:00:00 sh -c ({ while [ -d '/tmp/workspace/horizon_poc_PR-11@tmp/durable-6069406e' -a \! -f '/tmp/workspace/horizon_poc_PR-11@tmp/durable-6069406e/jenkins-result.txt' ]; do touch '/tmp/workspace/ho
root         451     450  0 17:42 ?        00:00:00 sh -c ({ while [ -d '/tmp/workspace/horizon_poc_PR-11@tmp/durable-6069406e' -a \! -f '/tmp/workspace/horizon_poc_PR-11@tmp/durable-6069406e/jenkins-result.txt' ]; do touch '/tmp/workspace/ho
root         452     450  0 17:42 ?        00:00:00 /bin/bash /tmp/workspace/horizon_poc_PR-11@tmp/durable-6069406e/script.sh
root         467     452  1 17:42 ?        00:00:01 /data/bread-service/py3.9.13/bin/python3.9 /data/bread-service/py3.9.13/bin/inv pants-tailor-check
root         468     467  0 17:42 pts/0    00:00:00 /home/jenkins/.pex/venvs/0bd641e3a90c5dabea350e64a646029c08613838/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/bin/python /tmp/workspace/horizon_poc_PR-11/pants tailor --check update-build-files
root         529     348  0 17:42 ?        00:00:00 pantsd [/tmp/workspace/horizon_poc_PR-11]
root         530     348  0 17:42 ?        00:00:00 [python] <defunct>
root         666       0  0 17:43 pts/1    00:00:00 sh
root         682     451  0 17:44 ?        00:00:00 sleep 3
root         683     666  0 17:44 pts/1    00:00:00 ps -ef
w
hm! yea, no child processes at all… that’s really strange. are you able to install
gdb
?
if so, you can attach with
gdb -p $PANTSD_PID
, and then get the stacks of all threads with
thread apply all bt
r
hmm... Not sure it could be done. Would Pants level=debug help to diagnose?
w
it might, yea.
g
We still have this issue even without using deamon.
Copy code
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
 root           1  0.0  0.0  14356  2936 pts/0    Ss+  15:01   0:00 /usr/bin/dumb-init -- /usr/local/bin/run-jnlp-client 3302b058fe181044f8862e20b5ccad7c7b756c7938456bdd3e45f
 root           7  2.6  0.0 38199116 749388 ?     Ssl  15:01   0:59 java -XX:+UseParallelGC -XX:MinHeapFreeRatio=5 -XX:MaxHeapFreeRatio=10 -XX:GCTimeRatio=4 -XX:AdaptiveSizeP
 root         659  0.0  0.0  19756  3276 ?        S    15:03   0:00 sh -c ({ while [ -d '/tmp/workspace/m-north-star_disable-pants-demon@tmp/durable-a4181b92' -a \! -f '/tmp/
 root         660  0.0  0.0  19756  3816 ?        S    15:03   0:01 sh -c ({ while [ -d '/tmp/workspace/m-north-star_disable-pants-demon@tmp/durable-a4181b92' -a \! -f '/tmp/
 root         661  0.0  0.0  19760  4236 ?        S    15:03   0:00 sh -xe /tmp/workspace/m-north-star_disable-pants-demon@tmp/durable-a4181b92/script.sh
 root         676  1.2  0.0 288736652 170444 ?    Sl   15:03   0:26 /home/jenkins/.pex/venvs/0bd641e3a90c5dabea350e64a646029c08613838/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055
 root         797  0.0  0.0 288831044 150524 ?    S    15:03   0:00 /home/jenkins/.pex/venvs/0bd641e3a90c5dabea350e64a646029c08613838/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055
 root         798  0.0  0.0      0     0 ?        Z    15:03   0:00 [python] <defunct>
 root         799  0.0  0.0      0     0 ?        Z    15:03   0:00 [python] <defunct>
 root         834  0.0  0.0  19768  4364 pts/1    Ss   15:03   0:00 sh
 root        1704  3.1  0.0 292283248 234420 pts/1 Sl+ 15:08   0:59 /home/jenkins/.pex/venvs/0bd641e3a90c5dabea350e64a646029c08613838/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055
 root        2065  0.0  0.0  19768  4392 pts/2    Ss+  15:13   0:00 sh
 root        2216  0.0  0.0  19768  4328 pts/3    Ss   15:17   0:00 sh
 root        2373  0.0  0.0  28560  4472 pts/3    S+   15:17   0:00 view /tmp/workspace/m-north-star_disable-pants-demon@tmp/durable-a4181b92/script.sh
 root        3249  0.0  0.0  19768  4436 pts/4    Ss   15:35   0:00 sh
 root        3466  0.0  0.0  16484  3012 ?        S    15:39   0:00 sleep 3
 root        3467  0.0  0.0  57508  4772 pts/4    R+   15:39   0:00 ps aux
Copy code
sh-4.2# ps auxwwwwwww
 USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
 root           1  0.0  0.0  14356  2936 pts/0    Ss+  15:01   0:00 /usr/bin/dumb-init -- /usr/local/bin/run-jnlp-client 3302b058fe181044f8862e20b5ccad7c7b756c7938456bdd3e45fbb2da521a92 pants-9zkz1
 root           7  2.3  0.0 38199116 749892 ?     Ssl  15:01   1:07 java -XX:+UseParallelGC -XX:MinHeapFreeRatio=5 -XX:MaxHeapFreeRatio=10 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90 -cp /home/jenkins/remoting.jar hudson.remoting.jnlp.Main -headless -url <http://10.74.244.8:80/|http://10.74.244.8:80/> -tunnel 10.190.16.184:50000 3302b058fe181044f8862e20b5ccad7c7b756c7938456bdd3e45fbb2da521a92 pants-9zkz1
 root         659  0.0  0.0  19756  3276 ?        S    15:03   0:00 sh -c ({ while [ -d '/tmp/workspace/m-north-star_disable-pants-demon@tmp/durable-a4181b92' -a \! -f '/tmp/workspace/m-north-star_disable-pants-demon@tmp/durable-a4181b92/jenkins-result.txt' ]; do touch '/tmp/workspace/m-north-star_disable-pants-demon@tmp/durable-a4181b92/jenkins-log.txt'; sleep 3; done } & jsc=durable-d3f85785a53f35e7c70de1cbe73a38a5; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe  '/tmp/workspace/m-north-star_disable-pants-demon@tmp/durable-a4181b92/script.sh' > '/tmp/workspace/m-north-star_disable-pants-demon@tmp/durable-a4181b92/jenkins-log.txt' 2>&1; echo $? > '/tmp/workspace/m-north-star_disable-pants-demon@tmp/durable-a4181b92/jenkins-result.txt.tmp'; mv '/tmp/workspace/m-north-star_disable-pants-demon@tmp/durable-a4181b92/jenkins-result.txt.tmp' '/tmp/workspace/m-north-star_disable-pants-demon@tmp/durable-a4181b92/jenkins-result.txt'; wait) >&- 2>&- &
 root         660  0.0  0.0  19756  3828 ?        S    15:03   0:01 sh -c ({ while [ -d '/tmp/workspace/m-north-star_disable-pants-demon@tmp/durable-a4181b92' -a \! -f '/tmp/workspace/m-north-star_disable-pants-demon@tmp/durable-a4181b92/jenkins-result.txt' ]; do touch '/tmp/workspace/m-north-star_disable-pants-demon@tmp/durable-a4181b92/jenkins-log.txt'; sleep 3; done } & jsc=durable-d3f85785a53f35e7c70de1cbe73a38a5; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe  '/tmp/workspace/m-north-star_disable-pants-demon@tmp/durable-a4181b92/script.sh' > '/tmp/workspace/m-north-star_disable-pants-demon@tmp/durable-a4181b92/jenkins-log.txt' 2>&1; echo $? > '/tmp/workspace/m-north-star_disable-pants-demon@tmp/durable-a4181b92/jenkins-result.txt.tmp'; mv '/tmp/workspace/m-north-star_disable-pants-demon@tmp/durable-a4181b92/jenkins-result.txt.tmp' '/tmp/workspace/m-north-star_disable-pants-demon@tmp/durable-a4181b92/jenkins-result.txt'; wait) >&- 2>&- &
 root         661  0.0  0.0  19760  4236 ?        S    15:03   0:00 sh -xe /tmp/workspace/m-north-star_disable-pants-demon@tmp/durable-a4181b92/script.sh
 root         676  1.1  0.0 288736652 170444 ?    Sl   15:03   0:32 /home/jenkins/.pex/venvs/0bd641e3a90c5dabea350e64a646029c08613838/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/bin/python /tmp/workspace/m-north-star_disable-pants-demon/pants --tag=unit_test test ::
 root         797  0.0  0.0 288831044 150524 ?    S    15:03   0:00 /home/jenkins/.pex/venvs/0bd641e3a90c5dabea350e64a646029c08613838/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/bin/python /tmp/workspace/m-north-star_disable-pants-demon/pants --tag=unit_test test ::
 root         798  0.0  0.0      0     0 ?        Z    15:03   0:00 [python] <defunct>
 root         799  0.0  0.0      0     0 ?        Z    15:03   0:00 [python] <defunct>
 root         834  0.0  0.0  19768  4364 pts/1    Ss   15:03   0:00 sh
 root        1704  2.6  0.0 292283248 234420 pts/1 Sl+ 15:08   1:06 /home/jenkins/.pex/venvs/0bd641e3a90c5dabea350e64a646029c08613838/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/bin/python /tmp/workspace/m-north-star_disable-pants-demon/pants --tag=unit_test test ::
 root        2065  0.0  0.0  19768  4392 pts/2    Ss+  15:13   0:00 sh
 root        2216  0.0  0.0  19768  4328 pts/3    Ss   15:17   0:00 sh
 root        2373  0.0  0.0  28560  4472 pts/3    S+   15:17   0:00 view /tmp/workspace/m-north-star_disable-pants-demon@tmp/durable-a4181b92/script.sh
 root        3249  0.0  0.0  19768  4436 pts/4    Ss+  15:35   0:00 sh
 root        3937  0.3  0.0  19768  4344 pts/5    Ss   15:49   0:00 sh
 root        3947  0.0  0.0  16484  2980 ?        S    15:49   0:00 sleep 3
 root        3948  0.0  0.0  57508  4784 pts/5    R+   15:49   0:00 ps auxwwwwwww
I don't know it is expected or not. I see some duplicated process? Eg pid 559 and 560, pid 676 and 797?