我很少將問題發佈到論壇,但是這一個讓我難住。我很好奇這是什麼原因引起的(解決方案也很好,但大多數情況下,我想知道爲什麼我會遇到這個問題):與RHEL和Debian上Python腳本不同的行爲,幾乎相同的Python版本
我最近編寫了一個python腳本來包裝調用它們由PBS作業啓動遠程命令:
#! /usr/bin/env python
#
# Copyright (c) 2009 Maciej Brodowicz
# Copyright (c) 2011 Bryce Lelbach
#
# Distributed under the Boost Software License, Version 1.0. (See accompanying
# file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
from datetime import datetime
from string import letters, digits
from types import StringType
from optparse import OptionParser
from threading import Thread
# subprocess instantiation wrapper. Unfortunately older Python still lurks on
# some machines.
try:
from subprocess import Popen, STDOUT, PIPE
from types import StringType
class process:
_proc = None
_exec = None
def __init__(self, cmd):
self._proc = Popen(cmd, stderr = STDOUT, stdout = PIPE,
shell = (False, True)[type(cmd) == StringType])
def poll(self):
return self._proc.poll()
def pid(self):
return self._proc.pid
def _call(self):
# annoyingly, KeyboardInterrupts are transported to threads, while most
# other Exceptions aren't in python
try:
self._proc.wait()
except Exception, err:
self._exec = err
def wait(self, timeout=None):
if timeout is not None:
thread = Thread(target=self._call)
thread.start()
# wait for the thread and invoked process to finish
thread.join(timeout)
# be forceful
if thread.is_alive():
self._proc.terminate()
thread.join()
# if an exception happened, re-raise it here in the master thread
if self._exec is not None:
raise self._exec
return (True, self._proc.returncode)
if self._exec is not None:
raise self._exec
return (False, self._proc.returncode)
else:
return (False, self._proc.wait())
def read(self):
return self._proc.stdout.read()
except ImportError, err:
# no "subprocess"; use older popen module
from popen2 import Popen4
from signal import SIGKILL
from os import kill, waitpid, WNOHANG
class process:
_proc = None
def __init__(self, cmd):
self._proc = Popen4(cmd)
def poll(self):
return self._proc.poll()
def pid(self):
return self._proc.pid
def _call(self):
# annoyingly, KeyboardInterrupts are transported to threads, while most
# other Exceptions aren't in python
try:
self._proc.wait()
except Exception, err:
self._exec = err
def wait(self, timeout=None):
if timeout is not None:
thread = Thread(target=self._call)
thread.start()
# wait for the thread and invoked process to finish
thread.join(timeout)
# be forceful
if thread.is_alive():
kill(self._proc.pid, SIGKILL)
waitpid(-1, WNOHANG)
thread.join()
# if an exception happened, re-raise it here in the master thread
if self._exec is not None:
raise self._exec
return (True, self._proc.wait())
if self._exec is not None:
raise self._exec
return (False, self._proc.wait())
else:
return (False, self._proc.wait())
def read(self):
return self._proc.fromchild.read()
def run(cmd, timeout=3600):
start = datetime.now()
proc = process(cmd)
(timed_out, returncode) = proc.wait(timeout)
now = datetime.now()
output = ''
while True:
s = proc.read()
if s:
output += s
else:
break
return (returncode, output, timed_out)
def rstrip_last(s, chars):
if s[-1] in chars:
return s[:-1]
else:
return s
# {{{ main
usage = "usage: %prog [options]"
parser = OptionParser(usage=usage)
parser.add_option("--timeout",
action="store", type="int",
dest="timeout", default=3600,
help="Program timeout (seconds)")
parser.add_option("--program",
action="store", type="string",
dest="program",
help="Program to invoke")
(options, cmd) = parser.parse_args()
if None == options.program:
print "No program specified"
exit(1)
(returncode, output, timed_out) = run(options.program, options.timeout)
if not 0 == len(output):
print rstrip_last(output, '\n')
if timed_out:
print "Program timed out"
exit(returncode)
# }}}
另一個python腳本放在一起根據所報告的PBS可用資源的命令行參數,類似的mpirun。我使用python-paramiko通過SSH啓動遠程命令。最初我只是直接執行了這些命令,但是當其中一個遠程運行進程用信號(例如SIGSEGV)退出時,我沒有收到正確的退出代碼。因此,需要上述腳本。
在我的開發集羣上運行此腳本時,我注意到這個腳本在我的4核Debian GNU/Linux節點上微妙地失效,但它在我的48核RHEL/Linux節點上工作:
Debian的節點:
[email protected]:~/sandbox$ python --version
Python 2.6.7
[email protected]:~/sandbox$ uname -a
Linux hermione0 2.6.32-5-amd64 #1 SMP Wed Jan 12 03:40:32 UTC 2011 x86_64 GNU/Linux
[email protected]:~/sandbox$ time ./hpx_invoke.py --program='sleep 30' --timeout=5
Program timed out
real 0m30.025s
user 0m0.016s
sys 0m0.012s
[email protected]ermione0:~/sandbox$
在RHEL節點:
[22:08:23]:[email protected]:/home/wash/sandbox$ python --version
Python 2.6.6
[22:09:28]:[email protected]:/home/wash/sandbox$ uname -a
Linux vega 2.6.32-131.4.1.el6.x86_64 #1 SMP Fri Jun 10 10:54:26 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
[22:09:30]:[email protected]:/home/wash/sandbox$ time ./hpx_invoke.py --program='sleep 30' --timeout=5
Program timed out
real 0m5.053s
user 0m0.040s
sys 0m0.020s
[22:09:41]:[email protected]:/home/wash/sandbox$
可能是什麼造成的?
P.S.我是這些盒子上的系統管理員。
也許這只是一種遲到,但是在RHEL節點上成功的Debian節點上發生了什麼故障?我瞭解RHEL版本「超時」,「更快」,但預期的行爲究竟是什麼? – Andrew
這不晚。如果被調用的程序在Debian上運行了三個小時,超時時間爲40秒,它只會在退出後被殺死。在RHEL上,它在指定的超時後超時。 – wash
看到這麼多的代碼太晚了。我的腦子全部都是通過整天查看代碼而炸的,難道你不能只是粘貼一條可疑的線路,然後我會說「啊哈,你沒有關閉一個括號」,然後你把自己拍在頭上,我咧嘴一笑一個沾沾自喜的笑容,因爲我知道這可能是我現在可能回答的唯一問題。 –