2011-03-06 54 views
10

使用urllibs(或urllibs2)並且想要我想要的是無望的。 任何解決方案?Python - 在http響應流中搜尋

+0

'在http響應流中尋找'是什麼意思? – phooji 2011-03-06 07:00:35

+0

我曾經使用過C#,而我所說的實現就是這樣的:'WebClient.OpenRead()。Seek()'。 – 2011-03-12 19:14:05

+0

一個簡單的包裝器對象可以使用http範圍標題給你這個功能:http://stackoverflow.com/questions/7829311/is-there-a-library-for-retrieving-a-file-from-a-remote- zip/7852229#7852229 – retracile 2011-10-21 16:29:46

回答

22

我不知道C#的實現是如何工作的,但是,隨着互聯網流一般不可以搜索的,我的猜測是它將所有數據下載到本地文件或內存中的對象,並從那裏尋找它。 Python的等價物就是按照Abafei的建議,將數據寫入文件或StringIO並從那裏尋找。但是,如果您對Abafei的答案所做的評論建議,如果您只想檢索文件的特定部分(而不是通過返回的數據向後和向後搜索),那麼還有另一種可能性。 urllib2可用於檢索網頁的某個部分(或HTTP範圍中的「範圍」),前提是服務器支持此行爲。

range

當你向服務器發送一個請求,請求的參數在不同的頭信息中給出。其中之一是在section 14.35 of RFC2616(定義HTTP/1.1的規範)中定義的Range標頭。這個頭允許你做一些事情,比如從第10,000個字節開始檢索所有的數據,或者從1000到1500個字節之間的數據。

服務器支持

有一個服務器支持範圍檢索沒有要求。某些服務器將返回Accept-Ranges標題(section 14.5 of RFC2616)以及響應報告它們是否支持範圍。這可以使用HEAD請求進行檢查。但是,沒有特別需要這樣做;如果服務器不支持範圍,它將返回整個頁面,然後我們可以像以前一樣在Python中提取所需的數據部分。

檢查如果一個範圍被返回

如果服務器返回一個範圍,它必須與響應一起發送Content-Range報頭(section 14.16 of RFC2616)。如果這在響應的頭部出現,我們知道返回了一個範圍;如果它不存在,則返回整個頁面。

與urllib2的

urllib2實現使我們頭添加到一個請求,從而使我們能夠向服務器請求一個範圍,而不是整個頁面。以下腳本會在命令行中獲取URL,起始位置和(可選)長度,並嘗試檢索頁面的給定部分。

import sys 
import urllib2 

# Check command line arguments. 
if len(sys.argv) < 3: 
    sys.stderr.write("Usage: %s url start [length]\n" % sys.argv[0]) 
    sys.exit(1) 

# Create a request for the given URL. 
request = urllib2.Request(sys.argv[1]) 

# Add the header to specify the range to download. 
if len(sys.argv) > 3: 
    start, length = map(int, sys.argv[2:]) 
    request.add_header("range", "bytes=%d-%d" % (start, start + length - 1)) 
else: 
    request.add_header("range", "bytes=%s-" % sys.argv[2]) 

# Try to get the response. This will raise a urllib2.URLError if there is a 
# problem (e.g., invalid URL). 
response = urllib2.urlopen(request) 

# If a content-range header is present, partial retrieval worked. 
if "content-range" in response.headers: 
    print "Partial retrieval successful." 

    # The header contains the string 'bytes', followed by a space, then the 
    # range in the format 'start-end', followed by a slash and then the total 
    # size of the page (or an asterix if the total size is unknown). Lets get 
    # the range and total size from this. 
    range, total = response.headers['content-range'].split(' ')[-1].split('/') 

    # Print a message giving the range information. 
    if total == '*': 
     print "Bytes %s of an unknown total were retrieved." % range 
    else: 
     print "Bytes %s of a total of %s were retrieved." % (range, total) 

# No header, so partial retrieval was unsuccessful. 
else: 
    print "Unable to use partial retrieval." 

# And for good measure, lets check how much data we downloaded. 
data = response.read() 
print "Retrieved data size: %d bytes" % len(data) 

利用這一點,我可以檢索的最後2000個字節的Python主頁:

[email protected]:~$ python retrieverange.py http://www.python.org/ 6000 400 
Partial retrieval successful. 
Bytes 6000-6399 of a total of 19387 were retrieved. 
Retrieved data size: 400 bytes 

然而,谷歌:從首頁的中間

[email protected]:~$ python retrieverange.py http://www.python.org/ 17387 
Partial retrieval successful. 
Bytes 17387-19386 of a total of 19387 were retrieved. 
Retrieved data size: 2000 bytes 

或400字節主頁不支持範圍:

[email protected]:~$ python retrieverange.py http://www.google.com/ 1000 500 
Unable to use partial retrieval. 
Retrieved data size: 9621 bytes 

在這種情況下,在進行任何進一步處理之前,有必要提取Python中感興趣的數據。

3

只需將數據寫入文件(甚至是使用StringIO的字符串)並在該文件(或字符串)中查找就可能最好。

+3

比方說,從1MB的響應來看,第一個900KB對我來說是沒用的,所以這是一個加快進程而不下載的機會。 – 2011-03-12 19:11:25

0

我沒有發現任何現有的實現類似於seek()到HTTP URL的文件類接口,所以我推出了我自己的簡單版本:https://github.com/valgur/pyhttpio。這取決於urllib.request,但如有必要,可能很容易修改爲使用requests

的完整代碼:

import cgi 
import time 
import urllib.request 
from io import IOBase 
from sys import stderr 


class SeekableHTTPFile(IOBase): 
    def __init__(self, url, name=None, repeat_time=-1, debug=False): 
     """Allow a file accessible via HTTP to be used like a local file by utilities 
     that use `seek()` to read arbitrary parts of the file, such as `ZipFile`. 
     Seeking is done via the 'range: bytes=xx-yy' HTTP header. 

     Parameters 
     ---------- 
     url : str 
      A HTTP or HTTPS URL 
     name : str, optional 
      The filename of the file. 
      Will be filled from the Content-Disposition header if not provided. 
     repeat_time : int, optional 
      In case of HTTP errors wait `repeat_time` seconds before trying again. 
      Negative value or `None` disables retrying and simply passes on the exception (the default). 
     """ 
     super().__init__() 
     self.url = url 
     self.name = name 
     self.repeat_time = repeat_time 
     self.debug = debug 
     self._pos = 0 
     self._seekable = True 
     with self._urlopen() as f: 
      if self.debug: 
       print(f.getheaders()) 
      self.content_length = int(f.getheader("Content-Length", -1)) 
      if self.content_length < 0: 
       self._seekable = False 
      if f.getheader("Accept-Ranges", "none").lower() != "bytes": 
       self._seekable = False 
      if name is None: 
       header = f.getheader("Content-Disposition") 
       if header: 
        value, params = cgi.parse_header(header) 
        self.name = params["filename"] 

    def seek(self, offset, whence=0): 
     if not self.seekable(): 
      raise OSError 
     if whence == 0: 
      self._pos = 0 
     elif whence == 1: 
      pass 
     elif whence == 2: 
      self._pos = self.content_length 
     self._pos += offset 
     return self._pos 

    def seekable(self, *args, **kwargs): 
     return self._seekable 

    def readable(self, *args, **kwargs): 
     return not self.closed 

    def writable(self, *args, **kwargs): 
     return False 

    def read(self, amt=-1): 
     if self._pos >= self.content_length: 
      return b"" 
     if amt < 0: 
      end = self.content_length - 1 
     else: 
      end = min(self._pos + amt - 1, self.content_length - 1) 
     byte_range = (self._pos, end) 
     self._pos = end + 1 
     with self._urlopen(byte_range) as f: 
      return f.read() 

    def readall(self): 
     return self.read(-1) 

    def tell(self): 
     return self._pos 

    def __getattribute__(self, item): 
     attr = object.__getattribute__(self, item) 
     if not object.__getattribute__(self, "debug"): 
      return attr 

     if hasattr(attr, '__call__'): 
      def trace(*args, **kwargs): 
       a = ", ".join(map(str, args)) 
       if kwargs: 
        a += ", ".join(["{}={}".format(k, v) for k, v in kwargs.items()]) 
       print("Calling: {}({})".format(item, a)) 
       return attr(*args, **kwargs) 

      return trace 
     else: 
      return attr 

    def _urlopen(self, byte_range=None): 
     header = {} 
     if byte_range: 
      header = {"range": "bytes={}-{}".format(*byte_range)} 
     while True: 
      try: 
       r = urllib.request.Request(self.url, headers=header) 
       return urllib.request.urlopen(r) 
      except urllib.error.HTTPError as e: 
       if self.repeat_time is None or self.repeat_time < 0: 
        raise 
       print("Server responded with " + str(e), file=stderr) 
       print("Sleeping for {} seconds before trying again".format(self.repeat_time), file=stderr) 
       time.sleep(self.repeat_time) 

一個潛在的使用示例:

url = "https://www.python.org/ftp/python/3.5.0/python-3.5.0-embed-amd64.zip" 
f = SeekableHTTPFile(url, debug=True) 
zf = ZipFile(f) 
zf.printdir() 
zf.extract("python.exe") 

編輯:實際上有一個大多是相同的,如果稍微最小的,在這個答案的實現:https://stackoverflow.com/a/7852229/2997179