2015-11-02 84 views
0

我有一個包含文字與像HTML鏈接一個SQL轉儲文件:查找,解碼和替換文本文件中的所有的base64值

<a href="http://blahblah.org/kb/getattachment.php?data=NHxUb3Bjb25fZGF0YS1kb3dubG9hZF9ob3d0by5wZGY=">attached file</a> 

我想找到,解碼和替換的以base64部分每個鏈接中的文本。

我一直在嘗試使用Python w /正則表達式和base64來完成這項工作。但是,我的正則表達式技能不能勝任這項任務。

我需要選擇與

'getattachement.php?data=' 

開始並以

'"' 

我然後需要「數據=」和「& QUOT」使用Base64之間的部分進行解碼結束的任何字符串。 b64decode()

結果應該是這個樣子:

<a href="http://blahblah.org/kb/4/Topcon_data-download_howto.pdf">attached file</a> 

我認爲解決方案將類似於:

import re 
import base64 
with open('phpkb_articles.sql') as f: 
    for line in f: 
     re.sub(some_regex_expression_here, some_function_here_to_decode_base64) 

任何想法?

編輯:回答任何有興趣的人。

import re 
import base64 
import sys 


def decode_base64(s): 
    """ 
    Method to decode base64 into ascii 
    """ 
    # fix escaped equal signs in some base64 strings 
    base64_string = re.sub('%3D', '=', s.group(1)) 
    decodedString = base64.b64decode(base64_string) 

    # substitute '|' for '/' 
    decodedString = re.sub('\|', '/', decodedString) 

    # escape the spaces in file names 
    decodedString = re.sub(' ', '%20', decodedString) 

    # print 'assets/' + decodedString + '&quot' # Print for debug 
    return 'assets/' + decodedString + '&quot' 


count = 0 

pattern = r'getattachment.php\?data=([^&]+?)&quot' 

# Open the file and read line by line 
with open('phpkb_articles.sql') as f: 
    for line in f: 
     try: 
      # globally substitute in new file path 
      edited_line = re.sub(pattern, decode_base64, line) 
      # output the edited line to standard out 
      sys.stdout.write(edited_line) 
     except TypeError: 
      # output unedited line if decoding fails to prevent corruption 
      sys.stdout.write(line) 
      # print line 
      count += 1 

回答

1

你已經擁有了它,你只需要在小片:

模式:r'data=([^&]+?)&quot'data=後匹配任何之前&quot

>>> pat = r'data=([^&]+?)&quot' 
>>> line = '<a href="http://blahblah.org/kb/getattachment.php?data=NHxUb3Bjb25fZGF0YS1kb3dubG9hZF9ob3d0by5wZGY=">attached file</a>' 
>>> decodeString = re.search(pat,line).group(1) #because the b64 string is capture by grouping, we only want group(1) 
>>> decodeString 
'NHxUb3Bjb25fZGF0YS1kb3dubG9hZF9ob3d0by5wZGY=' 

然後你可以使用str.replace()方法以及base64.b64decode()完成其餘的方法。我不想只爲你寫代碼,但這應該給你一個好主意去哪裏。

相關問題