2011-06-30 19 views
4

我很少將問題發佈到論壇,但是這一個讓我難住。我很好奇這是什麼原因引起的(解決方案也很好,但大多數情況下,我想知道爲什麼我會遇到這個問題):與RHEL和Debian上Python腳本不同的行爲,幾乎相同的Python版本

我最近編寫了一個python腳本來包裝調用它們由PBS作業啓動遠程命令:

#! /usr/bin/env python 
# 
# Copyright (c) 2009 Maciej Brodowicz 
# Copyright (c) 2011 Bryce Lelbach 
# 
# Distributed under the Boost Software License, Version 1.0. (See accompanying 
# file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) 

from datetime import datetime 

from string import letters, digits 

from types import StringType 

from optparse import OptionParser 

from threading import Thread 

# subprocess instantiation wrapper. Unfortunately older Python still lurks on 
# some machines. 
try: 
    from subprocess import Popen, STDOUT, PIPE 
    from types import StringType 

    class process: 
    _proc = None 
    _exec = None 

    def __init__(self, cmd): 
     self._proc = Popen(cmd, stderr = STDOUT, stdout = PIPE, 
     shell = (False, True)[type(cmd) == StringType]) 

    def poll(self): 
     return self._proc.poll() 

    def pid(self): 
     return self._proc.pid 

    def _call(self): 
     # annoyingly, KeyboardInterrupts are transported to threads, while most 
     # other Exceptions aren't in python 
     try: 
     self._proc.wait() 
     except Exception, err: 
     self._exec = err 

    def wait(self, timeout=None): 
     if timeout is not None: 
     thread = Thread(target=self._call) 
     thread.start() 

     # wait for the thread and invoked process to finish 
     thread.join(timeout) 

     # be forceful 
     if thread.is_alive(): 
      self._proc.terminate() 
      thread.join() 

      # if an exception happened, re-raise it here in the master thread 
      if self._exec is not None: 
      raise self._exec 

      return (True, self._proc.returncode) 

     if self._exec is not None: 
      raise self._exec 

     return (False, self._proc.returncode) 

     else: 
     return (False, self._proc.wait()) 

    def read(self): 
     return self._proc.stdout.read() 

except ImportError, err: 
    # no "subprocess"; use older popen module 
    from popen2 import Popen4 
    from signal import SIGKILL 
    from os import kill, waitpid, WNOHANG 

    class process: 
    _proc = None 

    def __init__(self, cmd): 
     self._proc = Popen4(cmd) 

    def poll(self): 
     return self._proc.poll() 

    def pid(self): 
     return self._proc.pid 

    def _call(self): 
     # annoyingly, KeyboardInterrupts are transported to threads, while most 
     # other Exceptions aren't in python 
     try: 
     self._proc.wait() 
     except Exception, err: 
     self._exec = err 

    def wait(self, timeout=None): 
     if timeout is not None: 
     thread = Thread(target=self._call) 
     thread.start() 

     # wait for the thread and invoked process to finish 
     thread.join(timeout) 

     # be forceful 
     if thread.is_alive(): 
      kill(self._proc.pid, SIGKILL) 
      waitpid(-1, WNOHANG) 
      thread.join() 

      # if an exception happened, re-raise it here in the master thread 
      if self._exec is not None: 
      raise self._exec 

      return (True, self._proc.wait()) 

     if self._exec is not None: 
      raise self._exec 

     return (False, self._proc.wait()) 

     else: 
     return (False, self._proc.wait()) 

    def read(self): 
     return self._proc.fromchild.read() 

def run(cmd, timeout=3600): 
    start = datetime.now() 
    proc = process(cmd) 
    (timed_out, returncode) = proc.wait(timeout) 
    now = datetime.now() 

    output = '' 

    while True: 
    s = proc.read() 

    if s: 
     output += s 
    else: 
     break 

    return (returncode, output, timed_out) 

def rstrip_last(s, chars): 
    if s[-1] in chars: 
    return s[:-1] 
    else: 
    return s 

# {{{ main 
usage = "usage: %prog [options]" 

parser = OptionParser(usage=usage) 

parser.add_option("--timeout", 
        action="store", type="int", 
        dest="timeout", default=3600, 
        help="Program timeout (seconds)") 

parser.add_option("--program", 
        action="store", type="string", 
        dest="program", 
        help="Program to invoke") 

(options, cmd) = parser.parse_args() 

if None == options.program: 
    print "No program specified" 
    exit(1) 

(returncode, output, timed_out) = run(options.program, options.timeout) 

if not 0 == len(output): 
    print rstrip_last(output, '\n') 

if timed_out: 
    print "Program timed out" 

exit(returncode) 
# }}} 

另一個python腳本放在一起根據所報告的PBS可用資源的命令行參數,類似的mpirun。我使用python-paramiko通過SSH啓動遠程命令。最初我只是直接執行了這些命令,但是當其中一個遠程運行進程用信號(例如SIGSEGV)退出時,我沒有收到正確的退出代碼。因此,需要上述腳本。

在我的開發集羣上運行此腳本時,我注意到這個腳本在我的4核Debian GNU/Linux節點上微妙地失效,但它在我的48核RHEL/Linux節點上工作:

Debian的節點:

[email protected]:~/sandbox$ python --version 
Python 2.6.7 
[email protected]:~/sandbox$ uname -a 
Linux hermione0 2.6.32-5-amd64 #1 SMP Wed Jan 12 03:40:32 UTC 2011 x86_64 GNU/Linux 
[email protected]:~/sandbox$ time ./hpx_invoke.py --program='sleep 30' --timeout=5 
Program timed out 

real 0m30.025s 
user 0m0.016s 
sys 0m0.012s 
[email protected]ermione0:~/sandbox$ 

在RHEL節點:

[22:08:23]:[email protected]:/home/wash/sandbox$ python --version 
Python 2.6.6 
[22:09:28]:[email protected]:/home/wash/sandbox$ uname -a 
Linux vega 2.6.32-131.4.1.el6.x86_64 #1 SMP Fri Jun 10 10:54:26 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux 
[22:09:30]:[email protected]:/home/wash/sandbox$ time ./hpx_invoke.py --program='sleep 30' --timeout=5 
Program timed out 

real 0m5.053s 
user 0m0.040s 
sys 0m0.020s 
[22:09:41]:[email protected]:/home/wash/sandbox$ 

可能是什麼造成的?

P.S.我是這些盒子上的系統管理員。

+2

也許這只是一種遲到,但是在RHEL節點上成功的Debian節點上發生了什麼故障?我瞭解RHEL版本「超時」,「更快」,但預期的行爲究竟是什麼? – Andrew

+0

這不晚。如果被調用的程序在Debian上運行了三個小時,超時時間爲40秒,它只會在退出後被殺死。在RHEL上,它在指定的超時後超時。 – wash

+1

看到這麼多的代碼太晚了。我的腦子全部都是通過整天查看代碼而炸的,難道你不能只是粘貼一條可疑的線路,然後我會說「啊哈,你沒有關閉一個括號」,然後你把自己拍在頭上,我咧嘴一笑一個沾沾自喜的笑容,因爲我知道這可能是我現在可能回答的唯一問題。 –

回答

0

問題原來是調用子進程作爲殼(兩臺機器都有子進程包)。在RHEL節點上,當/ bin/sh被終止時,被調用的程序也被終止。在Debian節點上,只有/ bin/sh進程被終止,並且被調用的程序保持活動狀態。

我通過改變腳本來解決這個問題,不再使用shell = True。

1

我猜想可用包中的差異導致您的「子流程實例化包裝器」的不同分支在任一機器上使用。在一個分支中,您將使用SIGTERM(terminate()調用),另一個分支使用SIGKILL。

話雖如此,sleep似乎過早地結束給任一信號。可能還有其他差異,但很難說。您最好放入一些調試代碼來查看在什麼機器上會發生什麼。

相關問題