循環遍歷二進制文件中的特定字節

我想查找具有特定字節的二進制文件中的點。例如說我想查一下我的文件兩個字節啓動所有實例：循環遍歷二進制文件中的特定字節

AB C3

，並用兩個字節結束：現在我

AB C4

做

while True: 
     byte = file.read(1) 
     if not byte: 
      break 
     if ord(byte) == 171:

但後來我將如何繼續循環，這樣一旦我找到的第一個AB-我會看到，連續的下一個字節是C3。然後，一旦我找到C3，我將如何以字節讀取循環，直到序列AB C4（如果存在），而不會搞亂我的整體循環結構。

我遇到了困難，因爲我不知道如何處理python的讀取和查找功能。當我找到序列時，是否應該保留一個指針以回溯？有沒有一種簡單的方法來做我想在python中做什麼，我只是不知道？

謝謝。

來源

2017-03-06 J. Doe

該文件有多大（即可以將其全部讀入內存）？ – Marat

也，你認爲只是做'grep -aob「\ xab \ xc4」'？ – Marat

你使用的是什麼版本的Python？ – martineau

假設你可以將整個文件讀入內存：

import re 
import operator 

with open(filename, 'rb') as file: 
    bytes = file.read() 

matches = [(i.start(),i.end()) 
      for i in re.finditer(b'\xab\xc3*\xab\xc3', bytes)]

在matches每個元組包含一個開始索引和最終c3後停止指數（使用切片表示法當停止指數是一個索引位置字節）。切片全部不重疊。

如果你想所有重疊匹配指數，你需要的線沿線的改造matches：

overlapping = [(start, stop) for start in map(operator.itemgetter(0), matches) for stop in map(operator.itemgetter(1), matches) if start < stop]

來源

2017-03-06 19:04:36

好吧，如果你不能讀取整個文件到內存，可以完成這通過遍歷字節。我使用deque作爲輔助數據結構，利用maxlen參數掃描每個連續的字節對。爲了讓我使用for循環而不是容易出錯的while循環，我使用two-argument iter逐字節地遍歷文件a。例如iter(iterable, sentinal)首先，讓我們建立一個測試用例：

>>> import io, functools 
>>> import random 
>>> some_bytes = bytearray([random.randint(0, 255) for _ in range(12)] + [171, 195] + [88, 42, 88, 42, 88, 42] + [171, 196]+[200, 211, 141]) 
>>> some_bytes 
bytearray(b'\x80\xc4\x8b\x86i\x88\xba\x8a\x8b\x07\x9en\xab\xc3X*X*X*\xab\xc4\xc8\xd3\x8d') 
>>>

而現在，一些perliminaries：

>>> from collections import deque 
>>> start = deque([b'\xab', b'\xc3']) 
>>> stop = deque([b'\xab', b'\xc4']) 
>>> current = deque(maxlen=2) 
>>> target = [] 
>>> inside = False

讓我們假設我們從文件中讀取：

>>> f = io.BytesIO(some_bytes)

現在，創建方便的逐字節可迭代：

>>> read_byte = functools.partial(f.read, 1)

現在我們可以循環輕鬆了不少：

>>> for b in iter(read_byte, b''): 
...  current.append(b) 
...  if not inside and current == start: 
...   inside = True 
...   continue 
...  if inside and current == stop: 
...   break 
...  if inside: 
...   target.append(b) 
... 
>>> target 
[b'X', b'*', b'X', b'*', b'X', b'*', b'\xab'] 
>>>

你會發現這給在那裏的「結束」的第一個值。雖然清理起來很簡單。下面是一個更充實出例如，其中有幾個分隔符之間的字節「運行」：

>>> some_bytes = some_bytes * 3 
>>> start = deque([b'\xab', b'\xc3']) 
>>> stop = deque([b'\xab', b'\xc4']) 
>>> current = deque(maxlen=2) 
>>> targets = [] 
>>> target = [] 
>>> inside = False 
>>> f = io.BytesIO(some_bytes) 
>>> read_byte = functools.partial(f.read, 1) 
>>> for b in iter(read_byte, b''): 
...  current.append(b) 
...  if not inside and current == start: 
...   inside = True 
...   continue 
...  if inside and current == stop: 
...   inside = False 
...   target.pop() 
...   targets.append(target) 
...   target = [] 
...  if inside: 
...   target.append(b) 
... 
b'\xab' 
b'\xab' 
b'\xab' 
>>> targets 
[[b'X', b'*', b'X', b'*', b'X', b'*'], [b'X', b'*', b'X', b'*', b'X', b'*'], [b'X', b'*', b'X', b'*', b'X', b'*']] 
>>>

這種方法會比文件讀入內存，並使用re慢，但是這將是存儲器高效。可能有一些邊緣案例需要處理，我沒有想到，但我認爲應該直接延伸上述方法。另外，如果有一個「開始」字節序列沒有對應的「停止」，則列表將不斷增長，直到文件耗盡。

最後，也許最好的方法是以可管理的塊讀取文件，並使用下面的邏輯處理這些塊。這結合了空間和時間效率。在僞僞代碼中：

chunksize = 1024 
start = deque([b'\xab', b'\xc3']) 
stop = deque([b'\xab', b'\xc4']) 
current = deque(maxlen=2) 
targets = [] 
target = [] 
inside = False 
read_chunk = functools.partial(f.read, chunksize) 

for bytes_chunk in iter(read_chunk, b''): 
    for b in bytes_chunk: 
     < same logic as above >

來源

2017-03-06 19:24:37

循環遍歷二進制文件中的特定字節

回答

相關問題