Hi team, I want to follow up an issue I recently e...
# general
g
Hi team, I want to follow up an issue I recently encountered in my jenkins CI pipeline with Pants. For some reason, from time to time, the CI hangs at the following scenario. 1. if I restart the builds, it could pass or hangs for different modules
Copy code
[2023-04-03T19:05:54.375Z] 19:05:51.57 [INFO] Long running tasks:
[2023-04-03T19:05:54.375Z]   863.93s	Determine Python dependencies for X1.py
[2023-04-03T19:05:54.375Z]   864.06s	Determine Python dependencies for X2.py
[2023-04-03T19:05:54.375Z]   865.21s	Determine Python dependencies for X3.py
[2023-04-03T19:05:54.375Z]   865.24s	Determine Python dependencies for X4.py
[2023-04-03T19:05:54.375Z]   867.84s	Determine Python dependencies for X5.py
[2023-04-03T19:05:54.375Z]   867.85s	Determine Python dependencies for X6.py
[2023-04-03T19:05:54.375Z]   867.97s	Determine Python dependencies for X7.py
2. if I restart the builds, sometime it would go away
Copy code
11:58:10    360.38s   Test binary /bin/python.
11:58:10    360.38s   Test binary /data/env/py3.9.13/bin/python.
11:58:10    360.38s   Test binary /opt/conda/bin/python.
For both scenarios, I saw multiple pantsd exists. For example.
Copy code
sh-4.2# ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 17:08 pts/0    00:00:00 /usr/bin/dumb-init -- /usr/local/bin/run-jnlp-client 03564869d53ea68cd383a448680f9abfa4cc44fcd3f0480712dad2283953ec15 pan
root           7       1  2 17:08 ?        00:01:25 java -XX:+UseParallelGC -XX:MinHeapFreeRatio=5 -XX:MaxHeapFreeRatio=10 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90 
root         803       1  0 17:10 ?        00:00:00 sh -c ({ while [ -d '/tmp/workspace/script@tmp/durable-1914e334' -a \! -f '/tmp/workspace/ar_AT
root         804     803  0 17:10 ?        00:00:01 sh -c ({ while [ -d '/tmp/workspace/script@tmp/durable-1914e334' -a \! -f '/tmp/workspace/ar_AT
root         805     803  0 17:10 ?        00:00:00 sh -xe /tmp/workspace/ar_ATOMFM-327_single_eval_script@tmp/durable-1914e334/script.sh
root         820     805  0 17:10 ?        00:00:00 /home/jenkins/.pex/venvs/0bd641e3a90c5dabea350e64a646029c08613838/779eb2cc0ca9e2fdd204774cbc41848e4e7c5055/bin/python /tm
root         822     820  0 17:10 ?        00:00:00 [/home/jenkins/.] <defunct>
root         823       1 16 17:10 ?        00:10:40 pantsd [/tmp/workspace/ar_ATOMFM-327_single_eval_script]
root        1118     823  0 17:11 ?        00:00:00 pantsd [/tmp/workspace/ar_ATOMFM-327_single_eval_script]
root        1119     823  0 17:11 ?        00:00:00 pantsd [/tmp/workspace/ar_ATOMFM-327_single_eval_script]
root        1120     823  0 17:11 ?        00:00:00 pantsd [/tmp/workspace/ar_ATOMFM-327_single_eval_script]
root        1121     823  0 17:11 ?        00:00:00 pantsd [/tmp/workspace/ar_ATOMFM-327_single_eval_script]
root        4778       0  0 18:14 pts/1    00:00:00 sh
root        4785     804  0 18:14 ?        00:00:00 sleep 3
root        4786    4778  0 18:14 pts/1    00:00:00 ps -ef
I tried to disable pantsd in jenkins, it does not help. Instead of multiple pantsd, I would see the following for example.
Copy code
root         798  0.0  0.0      0     0 ?        Z    15:03   0:00 [python] <defunct>
root         799  0.0  0.0      0     0 ?        Z    15:03   0:00 [python] <defunct>
My feeling is that somehow the child processor go stuck, I am wondering how can I troubleshoot it and I cannot reproduce it locally. Your advice will be greatly appreciated.
w
hey Qiang! as mentioned in the previous thread, installing
gdb
on the host, attaching to the
pants
(or
pantsd
) process, and then running
thread apply all bt
might get us useful data.
g
Thanks @witty-crayon-22786. I am not familiar with gdb. I am wondering, do I need set it up before running pants, or I can use it when I see pants is hanging?
w
If you have access to install things in the container, then you can do it after you've seen the hang.
g
Cool. Let me give a try.