我正在處理髮生在nodejs應用程序中的奇怪的100%CPU使用問題。該應用程序非常龐大,我不確定它造成這種情況的位置。該應用程序由cluster_mode中的pm2進行管理。Nodejs由於clock_gettime/gettimeofday/futex,100%的CPU使用率?
我所知道的是在CPU使用率過高時,strace
輸出這樣的:
[email protected]:/# strace -p 4350 -c
Process 4350 attached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00 0.000031 0 3388 clock_gettime
0.00 0.000000 0 1 read
0.00 0.000000 0 2 write
0.00 0.000000 0 1 rt_sigreturn
------ ----------- ----------- --------- --------- ----------------
100.00 0.000031 3392 total
[email protected]:~# strace -p 3367 -r -c
Process 3367 attached
^CProcess 3367 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
75.00 0.000939 0 91973 gettimeofday
25.00 0.000313 0 39417 clock_gettime
------ ----------- ----------- --------- --------- ----------------
100.00 0.001252 131390 total
整個應用程序是在這一點上反應遲鈍。約5分鐘後,pm2
會撿起並重新啓動過程中由於「零內存報」:
2016-12-03-20:29:05 PM2 [PM2][WORKER] Process 1 restarted because it uses 0 memory and has ONLINE status
2016-12-03-20:29:05 PM2 Stopping app:api-v2 id:1
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:06 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:07 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:07 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:07 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:07 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:07 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:07 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:07 PM2 Process with pid 3367 still not killed, retrying...
2016-12-03-20:29:07 PM2 Process with pid 3367 still alive after 1600ms, sending it SIGKILL now...
2016-12-03-20:29:07 PM2 App name:api-v2 id:1 disconnected
2016-12-03-20:29:07 PM2 App [api-v2] with id [1] and pid [3367], exited with code [0] via signal [SIGKILL]
2016-12-03-20:29:07 PM2 Starting execution sequence in -cluster mode- for app name:api-v2 id:1
2016-12-03-20:29:07 PM2 App name:api-v2 id:1 online
顯然,這是由PM2錯誤引起的:https://github.com/Unitech/pm2/issues/2492。但是,如果他們修正了這個錯誤,它將不會重新啓動該進程,從而讓它停滯不前,現在我別無選擇,只能繼續使用舊版本。
如果我使用time
和strace
啓動進程,那麼:
real 0m45.765s
user 0m3.349s
sys 0m0.340s
[email protected]:~/$ strace -cf node /var/www/api-v2.js
Process 4020 attached
...
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
99.09 9.247853 4538 2038 26 futex
0.18 0.016793 1 17777 clock_gettime
0.16 0.015384 12 1262 epoll_wait
0.11 0.010522 116 91 poll
0.09 0.008339 2 5237 2437 stat
0.08 0.007856 6 1234 write
0.05 0.004309 3 1571 close
0.03 0.003150 2 1790 read
0.03 0.003150 2 1333 248 open
0.03 0.003046 11 265 mmap
0.02 0.002049 2 1186 lstat
0.02 0.001617 4 378 madvise
0.02 0.001535 2 917 fstat
0.02 0.001518 1 1773 gettimeofday
0.01 0.001096 1 1224 35 epoll_ctl
0.01 0.000983 3 329 37 connect
0.01 0.000792 1 667 329 accept4
0.01 0.000734 10 76 brk
0.01 0.000617 2 338 pread
0.00 0.000315 2 155 socket
0.00 0.000265 9 30 sendmmsg
0.00 0.000184 1 144 munmap
0.00 0.000162 1 113 mprotect
0.00 0.000125 4 35 sendto
0.00 0.000114 7 16 setsockopt
0.00 0.000078 1 60 recvfrom
0.00 0.000071 1 105 recvmsg
0.00 0.000064 2 35 writev
0.00 0.000052 7 8 clone
0.00 0.000049 2 20 20 access
0.00 0.000043 0 192 getsockname
0.00 0.000029 7 4 getdents
0.00 0.000024 1 36 bind
0.00 0.000023 23 1 readlink
0.00 0.000020 1 35 getsockopt
0.00 0.000019 19 1 execve
0.00 0.000018 0 86 9 ioctl
0.00 0.000011 2 5 rt_sigprocmask
0.00 0.000009 5 2 openat
0.00 0.000006 1 11 getcwd
0.00 0.000005 5 1 lseek
0.00 0.000005 0 35 rt_sigaction
0.00 0.000003 3 1 arch_prctl
0.00 0.000000 0 1 listen
0.00 0.000000 0 14 uname
0.00 0.000000 0 2 getrlimit
0.00 0.000000 0 2 getuid
0.00 0.000000 0 1 getgid
0.00 0.000000 0 1 geteuid
0.00 0.000000 0 1 getegid
0.00 0.000000 0 4 prctl
0.00 0.000000 0 1 setrlimit
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 1 clock_getres
0.00 0.000000 0 9 set_robust_list
0.00 0.000000 0 1 eventfd2
0.00 0.000000 0 1 epoll_create1
0.00 0.000000 0 2 dup3
0.00 0.000000 0 2 pipe2
------ ----------- ----------- --------- --------- ----------------
100.00 9.333037 40661 3141 total
我沒有在我的代碼的任何setTimeout
電話,但我想我有一個不依賴。我查閱了最近的更改,它似乎不涉及遞歸調用或永不結束的循環。
我觀察到零內存泄漏,即pm2內存大小不會增加超時。以前相同的程序運行了2個月而沒有重新啓動,負載類似。服務器具有更多的CPU,RAM,交換資源。
這個問題在Ubuntu的一些常規維護(升級nodejs的apt-get升級,以及mongodb和npm依賴性升級)後開始出現。 nodejs升級從4.6.1升級到4.6.2。但是當我降級到4.6.1時,問題仍然存在。我嘗試了4.4.7和6.9.1,沒有任何版本看起來沒有問題。
如何調試此問題?我從哪說起呢?
是否可以記錄傳入呼叫以構建模擬以在調試環境中進入該狀況? – Bernhard
你是否正在運行pm2觀看文件更改?我會禁用,看看是否導致高CPU使用率。 –
你需要更多地縮小這個範圍。這是否僅在特定位代碼被執行時纔會發生?你可以在不同的物理機器上重現它嗎? – tadman