如何在字節數組中找到未知模式？

我正在構建一個工具來幫助我對數據庫文件進行逆向工程。我將我的工具定位到固定記錄長度的平面文件。我知道： 1）每個記錄都有一個索引（ID）。 2）每個記錄由分隔符分隔。 3）每個記錄是固定的寬度。 4）每個記錄中的每一列至少有一個x00字節。 5）文件頭在開頭（我這麼說是因爲頭不包含分隔符）如何在字節數組中找到未知模式？

我在其他文件中找到的分隔符是：（xFAxFA，xFExFE，xFDxFD）但是這是一種考慮到我將來可能會在不同的數據庫上使用該工具，因此不相關。所以我需要一些能夠挑選出「模式」的東西，儘管它有多少個字節。可能不超過6個字節？如果它更多，它可能會消耗太多的數據。但是，我這樣做的經驗是有限的。

所以我想我的問題是，我將如何在一個大文件中找到UNKNOWN分隔符？我覺得給，我應該是「我知道」能夠編程的東西，我只是不知道從哪裏開始...

# Really loose pseudo code 
def begin_some_how 
    # THIS IS THE PART I NEED HELP WITH... 
    # find all non-zero non-ascii sets of 2 or more bytes that repeat more than twice. 
end 

def check_possible_record_lengths 
    possible_delimiter = begin_some_how 
    # test if any of the above are always the same number of bytes apart from each other(except one instance, the header...) 
    possible_records = file.split(possible_delimiter) 
    rec_length_count = possible_records.map{ |record| record.length}.uniq.count 
    if rec_length_count == 2 # The header will most likely not be the same size. 
    puts "Success! We found the fixed record delimiter: #{possible_delimiter} 
    else 
    puts "Wrong delimiter found" 
    end 
end

來源

2015-11-02 Peter Black

possible = [",", "."] 

result = [0, ""] 
possible.each do |delimiter| 
    sizes = file.split(delimiter).map{ |record| record.size } 
    next if sizes.size < 2 

    average = 0.0 + sizes.inject{|sum,x| sum + x } 
    average /= sizes.size #This should be the record length if this is the right delimiter 

    deviation = 0.0 + sizes.inject{|sum,x| sum + (x-average)**2 } 

    matching_value = average/(deviation**2) 
    if matching_value > result[0] then 
     result[0] = matching_value 
     result[1] = delimiter 
    end 

end

採取的事實記錄具有恆定的規模優勢。採取每個可能的分隔符，並檢查每條記錄與通常記錄長度的偏離程度。如果頭文件比文件的其餘部分足夠小，這應該起作用。

來源

2015-11-02 23:44:51 SlySherZ

這看起來不錯，但我更多問如何找到未知模式？我不想定義可能的分隔符，因爲它們可以是任何一組隨機字節。 –

@Peter Black這就是爲什麼你必須測試所有的可能性。這樣思考：如果你是在看文件而不是計算機，你會如何猜測什麼是分隔符，哪些不是？嘗試一下，看看哪些行爲像分隔符。 – SlySherZ

那麼找到未知模式的唯一方法是通過查找？我很希望能有某種可以測試模式的算法。例如： 1）查找重複超過兩次的2個或更多字節的所有非零非ASCII字符集。 2）測試以上任何一項是否總是相同的字節數（除了一個實例，頭......）。 –

如何在字節數組中找到未知模式？

回答

相關問題