使用urllibs（或urllibs2）並且想要我想要的是無望的。任何解決方案？Python - 在http響應流中搜尋

2011-03-06 Moshe Revah

'在http響應流中尋找'是什麼意思？ – phooji 2011-03-06 07:00:35

我曾經使用過C＃，而我所說的實現就是這樣的：'WebClient.OpenRead（）。Seek（）'。 – 2011-03-12 19:14:05

一個簡單的包裝器對象可以使用http範圍標題給你這個功能：http://stackoverflow.com/questions/7829311/is-there-a-library-for-retrieving-a-file-from-a-remote- zip/7852229＃7852229 – retracile 2011-10-21 16:29:46

我不知道C＃的實現是如何工作的，但是，隨着互聯網流一般不可以搜索的，我的猜測是它將所有數據下載到本地文件或內存中的對象，並從那裏尋找它。 Python的等價物就是按照Abafei的建議，將數據寫入文件或StringIO並從那裏尋找。但是，如果您對Abafei的答案所做的評論建議，如果您只想檢索文件的特定部分（而不是通過返回的數據向後和向後搜索），那麼還有另一種可能性。 urllib2可用於檢索網頁的某個部分（或HTTP範圍中的「範圍」），前提是服務器支持此行爲。

的`range`頭

當你向服務器發送一個請求，請求的參數在不同的頭信息中給出。其中之一是在section 14.35 of RFC2616（定義HTTP/1.1的規範）中定義的Range標頭。這個頭允許你做一些事情，比如從第10,000個字節開始檢索所有的數據，或者從1000到1500個字節之間的數據。

服務器支持

有一個服務器支持範圍檢索沒有要求。某些服務器將返回Accept-Ranges標題（section 14.5 of RFC2616）以及響應報告它們是否支持範圍。這可以使用HEAD請求進行檢查。但是，沒有特別需要這樣做;如果服務器不支持範圍，它將返回整個頁面，然後我們可以像以前一樣在Python中提取所需的數據部分。

檢查如果一個範圍被返回

如果服務器返回一個範圍，它必須與響應一起發送Content-Range報頭（section 14.16 of RFC2616）。如果這在響應的頭部出現，我們知道返回了一個範圍;如果它不存在，則返回整個頁面。

與urllib2的

urllib2實現使我們頭添加到一個請求，從而使我們能夠向服務器請求一個範圍，而不是整個頁面。以下腳本會在命令行中獲取URL，起始位置和（可選）長度，並嘗試檢索頁面的給定部分。

import sys 
import urllib2 

# Check command line arguments. 
if len(sys.argv) < 3: 
    sys.stderr.write("Usage: %s url start [length]\n" % sys.argv[0]) 
    sys.exit(1) 

# Create a request for the given URL. 
request = urllib2.Request(sys.argv[1]) 

# Add the header to specify the range to download. 
if len(sys.argv) > 3: 
    start, length = map(int, sys.argv[2:]) 
    request.add_header("range", "bytes=%d-%d" % (start, start + length - 1)) 
else: 
    request.add_header("range", "bytes=%s-" % sys.argv[2]) 

# Try to get the response. This will raise a urllib2.URLError if there is a 
# problem (e.g., invalid URL). 
response = urllib2.urlopen(request) 

# If a content-range header is present, partial retrieval worked. 
if "content-range" in response.headers: 
    print "Partial retrieval successful." 

    # The header contains the string 'bytes', followed by a space, then the 
    # range in the format 'start-end', followed by a slash and then the total 
    # size of the page (or an asterix if the total size is unknown). Lets get 
    # the range and total size from this. 
    range, total = response.headers['content-range'].split(' ')[-1].split('/') 

    # Print a message giving the range information. 
    if total == '*': 
     print "Bytes %s of an unknown total were retrieved." % range 
    else: 
     print "Bytes %s of a total of %s were retrieved." % (range, total) 

# No header, so partial retrieval was unsuccessful. 
else: 
    print "Unable to use partial retrieval." 

# And for good measure, lets check how much data we downloaded. 
data = response.read() 
print "Retrieved data size: %d bytes" % len(data)

利用這一點，我可以檢索的最後2000個字節的Python主頁：

[email protected]:~$ python retrieverange.py http://www.python.org/ 6000 400 
Partial retrieval successful. 
Bytes 6000-6399 of a total of 19387 were retrieved. 
Retrieved data size: 400 bytes

然而，谷歌：從首頁的中間

[email protected]:~$ python retrieverange.py http://www.python.org/ 17387 
Partial retrieval successful. 
Bytes 17387-19386 of a total of 19387 were retrieved. 
Retrieved data size: 2000 bytes

或400字節主頁不支持範圍：

[email protected]:~$ python retrieverange.py http://www.google.com/ 1000 500 
Unable to use partial retrieval. 
Retrieved data size: 9621 bytes

在這種情況下，在進行任何進一步處理之前，有必要提取Python中感興趣的數據。

來源

2011-04-17 13:52:55 Blair

只需將數據寫入文件（甚至是使用StringIO的字符串）並在該文件（或字符串）中查找就可能最好。

來源

2011-03-06 07:00:01 Abbafei

比方說，從1MB的響應來看，第一個900KB對我來說是沒用的，所以這是一個加快進程而不下載的機會。 – 2011-03-12 19:11:25

見

Python seek on remote file using HTTP

基於HTTP範圍支持該解決方案在RFC 2616中定義

來源

2011-04-17 12:23:28

服務器支持不受限制嗎？ – 2011-04-22 12:49:31

我沒有發現任何現有的實現類似於seek（）到HTTP URL的文件類接口，所以我推出了我自己的簡單版本：https://github.com/valgur/pyhttpio。這取決於urllib.request，但如有必要，可能很容易修改爲使用requests。

的完整代碼：

import cgi 
import time 
import urllib.request 
from io import IOBase 
from sys import stderr 


class SeekableHTTPFile(IOBase): 
    def __init__(self, url, name=None, repeat_time=-1, debug=False): 
     """Allow a file accessible via HTTP to be used like a local file by utilities 
     that use `seek()` to read arbitrary parts of the file, such as `ZipFile`. 
     Seeking is done via the 'range: bytes=xx-yy' HTTP header. 

     Parameters 
     ---------- 
     url : str 
      A HTTP or HTTPS URL 
     name : str, optional 
      The filename of the file. 
      Will be filled from the Content-Disposition header if not provided. 
     repeat_time : int, optional 
      In case of HTTP errors wait `repeat_time` seconds before trying again. 
      Negative value or `None` disables retrying and simply passes on the exception (the default). 
     """ 
     super().__init__() 
     self.url = url 
     self.name = name 
     self.repeat_time = repeat_time 
     self.debug = debug 
     self._pos = 0 
     self._seekable = True 
     with self._urlopen() as f: 
      if self.debug: 
       print(f.getheaders()) 
      self.content_length = int(f.getheader("Content-Length", -1)) 
      if self.content_length < 0: 
       self._seekable = False 
      if f.getheader("Accept-Ranges", "none").lower() != "bytes": 
       self._seekable = False 
      if name is None: 
       header = f.getheader("Content-Disposition") 
       if header: 
        value, params = cgi.parse_header(header) 
        self.name = params["filename"] 

    def seek(self, offset, whence=0): 
     if not self.seekable(): 
      raise OSError 
     if whence == 0: 
      self._pos = 0 
     elif whence == 1: 
      pass 
     elif whence == 2: 
      self._pos = self.content_length 
     self._pos += offset 
     return self._pos 

    def seekable(self, *args, **kwargs): 
     return self._seekable 

    def readable(self, *args, **kwargs): 
     return not self.closed 

    def writable(self, *args, **kwargs): 
     return False 

    def read(self, amt=-1): 
     if self._pos >= self.content_length: 
      return b"" 
     if amt < 0: 
      end = self.content_length - 1 
     else: 
      end = min(self._pos + amt - 1, self.content_length - 1) 
     byte_range = (self._pos, end) 
     self._pos = end + 1 
     with self._urlopen(byte_range) as f: 
      return f.read() 

    def readall(self): 
     return self.read(-1) 

    def tell(self): 
     return self._pos 

    def __getattribute__(self, item): 
     attr = object.__getattribute__(self, item) 
     if not object.__getattribute__(self, "debug"): 
      return attr 

     if hasattr(attr, '__call__'): 
      def trace(*args, **kwargs): 
       a = ", ".join(map(str, args)) 
       if kwargs: 
        a += ", ".join(["{}={}".format(k, v) for k, v in kwargs.items()]) 
       print("Calling: {}({})".format(item, a)) 
       return attr(*args, **kwargs) 

      return trace 
     else: 
      return attr 

    def _urlopen(self, byte_range=None): 
     header = {} 
     if byte_range: 
      header = {"range": "bytes={}-{}".format(*byte_range)} 
     while True: 
      try: 
       r = urllib.request.Request(self.url, headers=header) 
       return urllib.request.urlopen(r) 
      except urllib.error.HTTPError as e: 
       if self.repeat_time is None or self.repeat_time < 0: 
        raise 
       print("Server responded with " + str(e), file=stderr) 
       print("Sleeping for {} seconds before trying again".format(self.repeat_time), file=stderr) 
       time.sleep(self.repeat_time)

一個潛在的使用示例：

url = "https://www.python.org/ftp/python/3.5.0/python-3.5.0-embed-amd64.zip" 
f = SeekableHTTPFile(url, debug=True) 
zf = ZipFile(f) 
zf.printdir() 
zf.extract("python.exe")

編輯：實際上有一個大多是相同的，如果稍微最小的，在這個答案的實現：https://stackoverflow.com/a/7852229/2997179

來源

2015-11-29 15:49:37

Python - 在http響應流中搜尋

回答

的range頭

服務器支持

檢查如果一個範圍被返回

與urllib2的

相關問題

的`range`頭