查找，解碼和替換文本文件中的所有的base64值

我有一個包含文字與像HTML鏈接一個SQL轉儲文件：查找，解碼和替換文本文件中的所有的base64值

&lt;a href=&quot;http://blahblah.org/kb/getattachment.php?data=NHxUb3Bjb25fZGF0YS1kb3dubG9hZF9ob3d0by5wZGY=&quot;&gt;attached file&lt;/a&gt;

我想找到，解碼和替換的以base64部分每個鏈接中的文本。

我一直在嘗試使用Python w /正則表達式和base64來完成這項工作。但是，我的正則表達式技能不能勝任這項任務。

我需要選擇與

'getattachement.php?data='

開始並以

'&quot;'

我然後需要「數據=」和「& QUOT」使用Base64之間的部分進行解碼結束的任何字符串。 b64decode（）

結果應該是這個樣子：

&lt;a href=&quot;http://blahblah.org/kb/4/Topcon_data-download_howto.pdf&quot;&gt;attached file&lt;/a&gt;

我認爲解決方案將類似於：

import re 
import base64 
with open('phpkb_articles.sql') as f: 
    for line in f: 
     re.sub(some_regex_expression_here, some_function_here_to_decode_base64)

任何想法？

編輯：回答任何有興趣的人。

import re 
import base64 
import sys 


def decode_base64(s): 
    """ 
    Method to decode base64 into ascii 
    """ 
    # fix escaped equal signs in some base64 strings 
    base64_string = re.sub('%3D', '=', s.group(1)) 
    decodedString = base64.b64decode(base64_string) 

    # substitute '|' for '/' 
    decodedString = re.sub('\|', '/', decodedString) 

    # escape the spaces in file names 
    decodedString = re.sub(' ', '%20', decodedString) 

    # print 'assets/' + decodedString + '&quot' # Print for debug 
    return 'assets/' + decodedString + '&quot' 


count = 0 

pattern = r'getattachment.php\?data=([^&]+?)&quot' 

# Open the file and read line by line 
with open('phpkb_articles.sql') as f: 
    for line in f: 
     try: 
      # globally substitute in new file path 
      edited_line = re.sub(pattern, decode_base64, line) 
      # output the edited line to standard out 
      sys.stdout.write(edited_line) 
     except TypeError: 
      # output unedited line if decoding fails to prevent corruption 
      sys.stdout.write(line) 
      # print line 
      count += 1

來源

2015-11-02 hankivstmb

你已經擁有了它，你只需要在小片：

模式：r'data=([^&]+?)&quot'將data=後匹配任何之前&quot

>>> pat = r'data=([^&]+?)&quot' 
>>> line = '&lt;a href=&quot;http://blahblah.org/kb/getattachment.php?data=NHxUb3Bjb25fZGF0YS1kb3dubG9hZF9ob3d0by5wZGY=&quot;&gt;attached file&lt;/a&gt;' 
>>> decodeString = re.search(pat,line).group(1) #because the b64 string is capture by grouping, we only want group(1) 
>>> decodeString 
'NHxUb3Bjb25fZGF0YS1kb3dubG9hZF9ob3d0by5wZGY='

然後你可以使用str.replace()方法以及base64.b64decode()完成其餘的方法。我不想只爲你寫代碼，但這應該給你一個好主意去哪裏。

來源

2015-11-02 22:05:07

查找，解碼和替換文本文件中的所有的base64值

回答

相關問題