如何使用boto將文件從Amazon S3流式傳輸到Rackspace Cloudfiles？

我從S3複製文件到Cloudfiles，我想避免將文件寫入磁盤。 Python-Cloudfiles庫有一個object.stream（）調用，它看起來是我所需要的，但我無法在boto中找到等效的調用。我希望我能像做：如何使用boto將文件從Amazon S3流式傳輸到Rackspace Cloudfiles？

shutil.copyfileobj(s3Object.stream(),rsObject.stream())

這可能與伯特（或我想任何其他S3庫）？

來源

2011-10-02 joemastersemison

的〔smart_open]（https://github.com/piskvorky/smart_open）Python庫這樣做（閱讀和寫作）。 – Radim

在博託重點對象，它代表在S3對象，可以像使用迭代器，所以你應該能夠做這樣的事情：

>>> import boto 
>>> c = boto.connect_s3() 
>>> bucket = c.lookup('garnaat_pub') 
>>> key = bucket.lookup('Scan1.jpg') 
>>> for bytes in key: 
... write bytes to output stream

或者，在你的例子中，你可以這樣做：

>>> shutil.copyfileobj(key, rsObject.stream())

來源

2011-10-02 07:54:34 garnaat

這樣一個設計良好的圖書館:) – ehacinom

我估計至少一些人看到這個問題會像我，會想辦法到流通過線從博託線文件（或逗號逗號，或者任何其他分隔符）。這裏有一個簡單的方法來做到這一點：

def getS3ResultsAsIterator(self, aws_access_info, key, prefix):   
    s3_conn = S3Connection(**aws_access) 
    bucket_obj = s3_conn.get_bucket(key) 
    # go through the list of files in the key 
    for f in bucket_obj.list(prefix=prefix): 
     unfinished_line = '' 
     for byte in f: 
      byte = unfinished_line + byte 
      #split on whatever, or use a regex with re.split() 
      lines = byte.split('\n') 
      unfinished_line = lines.pop() 
      for line in lines: 
       yield line

@ garnaat上面的答案仍然很好，100％爲真。希望我仍然可以幫助別人。

來源

2013-06-03 04:29:35 Eli

拆分其他兩種類型的行結束符：'lines = re.split（r'[\ n \ r] +'，byte）' - 對於從Excel導出的CSV文件很有幫助 – marcfrodi

one more注意：在f：'循環中的字節完成後，我必須添加'yield unfinished_line'，否則最後一行將不會被處理。 – marcfrodi

有沒有很好的理由說明爲什麼這不是Boto3 API的一部分？如果不是，是否應該提交一個拉請求來解決這個問題？我會超級打倒類似的東西！ – lol

此線程中的其他答案都與boto相關，但S3.Object在boto3中不再可迭代。因此，下面不工作，它產生一個TypeError: 's3.Object' object is not iterable錯誤消息：

s3 = boto3.session.Session(profile_name=my_profile).resource('s3') 
    s3_obj = s3.Object(bucket_name=my_bucket, key=my_key) 

    with io.FileIO('sample.txt', 'w') as file: 
     for i in s3_obj: 
      file.write(i)

在boto3，該對象的內容提供的S3.Object.get()['Body']這不是一個可迭代或者，所以下面仍然不起作用：

body = s3_obj.get()['Body'] 
    with io.FileIO('sample.txt', 'w') as file: 
     for i in body: 
      file.write(i)

因此，另一種是使用讀法，但這種加載整個S3對象，其中的大文件打交道時內存並不總是一種可能性：

body = s3_obj.get()['Body'] 
    with io.FileIO('sample.txt', 'w') as file: 
     for i in body.read(): 
      file.write(i)

但read方法允許傳入amt參數，該參數指定我們要從基礎流讀取的字節數。此方法可以反覆調用，直到整個流已讀：

body = s3_obj.get()['Body'] 
    with io.FileIO('sample.txt', 'w') as file: 
     while file.write(body.read(amt=512)): 
      pass

挖掘到botocore.response.StreamingBody代碼人們認識到底層流也可用，所以我們可以遍歷如下：

body = s3_obj.get()['Body'] 
    with io.FileIO('sample.txt', 'w') as file: 
     for b in body._raw_stream: 
      file.write(b)

雖然谷歌上搜索我也看到了一些鏈接，可以使用的，但我沒試過：

來源

2016-11-17 17:32:35 smallo

非常有用的答案。謝謝@smallo。我很欣賞你揭示了我認爲大多數人都在尋找的private __raw_stream。 – saccharine

這是我的包裹的溶液流體：

import io 
class S3ObjectInterator(io.RawIOBase): 
    def __init__(self, bucket, key): 
     """Initialize with S3 bucket and key names""" 
     self.s3c = boto3.client('s3') 
     self.obj_stream = self.s3c.get_object(Bucket=bucket, Key=key)['Body'] 

    def read(self, n=-1): 
     """Read from the stream""" 
     return self.obj_stream.read() if n == -1 else self.obj_stream.read(n)

實例：

obj_stream = S3ObjectInterator(bucket, key) 
for line in obj_stream: 
    print line

來源

2016-11-28 22:26:10 jzhou

如何使用boto將文件從Amazon S3流式傳輸到Rackspace Cloudfiles？

回答

相關問題