2016-11-25 45 views
0

我想解析一個大文本文件,這個文件在換行符中用一個字符'//'分割。我的輸入文件是這樣的:在這個腳本中的Python正則表達式實現

ID HRPA_ECOLI    Reviewed;  130 AA. 
AC P43329; P76861; P76863; P77479; 
DE RecName: Full=ATP-dependent RNA helicase HrpA; 
DE   EC=3.6.4.13; 
GN Name=hrpA; OrderedLocusNames=b1413, JW5905; 
OS Escherichia coli (strain K12). 
OC Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacterales; 
OC Enterobacteriaceae; Escherichia. 
OX NCBI_TaxID=83333; 
DR RefSeq; NP_415931.4; NC_000913.3. 
DR RefSeq; WP_000139543.1; NZ_LN832404.1. 
DR ProteinModelPortal; P43329; -. 
DR KEGG; ecj:JW5905; -. 
DR KEGG; eco:b1413; -. 
DR PATRIC; 32118112; VBIEscCol129921_1476. 
DR KO; K03578; -. 
DR GO; GO:0005737; C:cytoplasm; IBA:GO_Central. 
DR GO; GO:0005524; F:ATP binding; IEA:UniProtKB-KW. 
DR Gene3D; 3.40.50.300; -; 2. 
DR InterPro; IPR003593; AAA+_ATPase. 
DR InterPro; IPR011545; DEAD/DEAH_box_helicase_dom. 
DR InterPro; IPR011709; DUF1605. 
DR Pfam; PF00270; DEAD; 1. 
DR Pfam; PF11898; DUF3418; 1. 
DR SMART; SM00382; AAA; 1. 
DR SMART; SM00487; DEXDc; 1. 
DR SMART; SM00847; HA2; 1. 
DR SMART; SM00490; HELICc; 1. 
DR SUPFAM; SSF52540; SSF52540; 1. 
DR TIGRFAMs; TIGR01967; DEAH_box_HrpA; 1. 
DR PROSITE; PS51192; HELICASE_ATP_BIND_1; 1. 
DR PROSITE; PS51194; HELICASE_CTER; 1. 
PE 3: Inferred from homology; 
KW ATP-binding; Complete proteome; Helicase; Hydrolase; 
KW Nucleotide-binding; Reference proteome. 
FT CHAIN   1 1300  ATP-dependent RNA helicase HrpA. 
FT        /FTId=PRO_0000055178. 
FT DOMAIN  87 250  Helicase ATP-binding. 
FT        {ECO:0000255|PROSITE-ProRule:PRU00541}. 
FT DOMAIN  274 444  Helicase C-terminal. 
SQ SEQUENCE 1300 AA; 149028 MW; A26601266D771638 CRC64; 
    MTEQQKLTFT ALQQRLDSLM LRDRLRFSRR LHGVKKVKNP DAQQAIFQEM AKEIDQAAGK 
    VLLREAARPE ITYPDNLPVS QKKQDILEAI RDHQVVIVAG ETGSGKTTQL PKICMELGRG 
    IKGLIGHTQP 
// 
ID T1RK_ECOLI    Reviewed;  1170 AA. 
AC P08956; Q2M5W6; 
DT 01-NOV-1988, integrated into UniProtKB/Swiss-Prot. 
DT 24-NOV-2009, sequence version 3. 

我也有一個id.txt文件,其中的每一行都有一個唯一的ID,如:

NP_415931.4 
... 

我要與輸入文件中的每個ID匹配,並且如果匹配,我想用正則表達式(使用輸入文件的特定段)提取某些信息並將它們保存在輸出csv文件中。例如,對於一個匹配字符 「GO:[0-9]」,我想出了:

#!/usr/bin/env python 
import re 
import pdb 

def peon(DATA, LIST, OUTPUT, sentinel = '\n//', pattern = re.compile('GO:[0-9]+')): 

    data = DATA.read() 
    for item in LIST: 
     find_me = item.strip()   
     j = 0 
     while True:   
      i = data.find(find_me, j) 
      if i < 0: 
       break 
      j = data.find(sentinel, i) 
      if j < 0: 
       j = len(data) 
      result = pattern.findall(data[i:j]) 
     OUTPUT.write('{}\t{}\n'.format(find_me, ', '.join(result))) 

def main(dataname, listname, outputname): 
    with open(dataname, 'rt') as DATA: 
     with open(listname, 'rt') as LIST: 
      with open(outputname, 'wt') as OUTPUT: 
       peon(DATA, LIST, OUTPUT) 

if __name__ == '__main__': 
    main('./input_file.txt', './id.txt', './output.csv') 

,這讓我像輸出:

NP_415931.4 GO:0005737, GO:0005524 

現在,這是我想要的字符匹配是(編號<>標題<>描述),

1 RefSeq_ID As given in id.txt file 
2 AA_Length In the line that starts with "ID" & ends with "AA." 
3 Protein_Name After "RecName: Full=" 
4 EC_Number After "EC=" 
5 Organism In the line that starts with "OS" 
6 NCBI_Taxid_ID After "NCBI_TaxID=" 
7 KEGG_ID After "KEGG;" 
8 KO_ID After "KO;" 
9 GO_ID As ''GO:[NUMBER]" 
10 InterPro_ID After "InterPro;" 
11 InterPro_Description After InterPro_ID , i.e, after 10 
12 Pfam_ID After "Pfam;" 
13 Pfam_Description After Pfam_ID, i.e, after 12 
14 PROSITE_ID After "PROSITE;" 
15 PROSITE_Description After PROSITE_ID, i.e, after 14 

我還附上一個PIC更好澄清:

input file

我想同時提取所有這些字符並將它們保存在帶有特定標題的輸出csv文件中。我改變正則表達式等之後提取,例如,「AA_Length」:

pattern = re.compile('[0-9]+ AA.') 

,它給:

NP_415931.4 130 AA; 

但它不完全是我所需要的模式。另外,我不確定前後匹配的正則表達式以及如何在單個腳本中實現它們。

如何在單個腳本中搜索所有這些模式並將輸出(帶標題)保存在csv文件中?

謝謝

PS:我想最終輸出的csv要看起來像:

output file

我的excel表是在這裏:https://sites.google.com/site/iicbbioinformatics/share

+0

從列表中的每個數據的相同位置設置的ID? –

+0

號碼.id可以是隨機的 –

回答

1

您的數據似乎沒有一個固定結構。

如何

  • 閱讀DATA
  • split他們在"\n//"
  • 檢查每個數據從列表
  • 使用一個正則表達式爲每個所需的ID爲ID設置
  • 以適當的順序將ID寫入CSV文件

基本上是:

import re, csv 

with open('./input_file.txt') as dfile: 
    DATA = dfile.read() 

with open('./id.txt') as lfile: 
    IDS = lfile.read().split('\n') 

headers = ['RefSeq_ID', 
      'AA_Length', 
      'Protein_Name', 
      'EC_Number', 
      'Organism', 
      'NCBI_Taxid_ID', 
      'KEGG_ID', 
      'KO_ID', 
      'GO_ID', 
      'InterPro_ID', 
      'InterPro_Description', 
      'Pfam_ID', 
      'Pfam_Description', 
      'PROSITE_ID', 
      'PROSITE_Description' 
      ] 

ofile = open('./output.csv', 'w') 
csvfile = csv.DictWriter(ofile, headers) 

csvfile.writeheader() 

for DATASET in DATA.split('\n//'): 
    found_ids = {'RefSeq_ID': ""} 

    for RefSeq_ID in IDS: 
     if RefSeq_ID in DATASET: 
      found_ids['RefSeq_ID'] = RefSeq_ID 
      break 
    if not found_ids['RefSeq_ID']: 
     continue 

    found_ids['AA_Length'] = ", ".join(re.findall('^ID.+\s+(\d+) AA\.$', DATASET, re.MULTILINE)) 
    found_ids['Protein_Name'] = ", ".join(re.findall('RecName: Full=(.+);', DATASET)) 
    found_ids['EC_Number'] = ", ".join(re.findall('EC=([\d\.]+);', DATASET)) 
    found_ids['Organism'] = ", ".join(re.findall('^OS\s+(.*)\.$', DATASET, re.MULTILINE)) 
    found_ids['NCBI_Taxid_ID'] = ", ".join(re.findall('NCBI_TaxID=(\d+);', DATASET)) 
    found_ids['KEGG_ID'] = ", ".join(re.findall('KEGG; (\w+:\w+\d+);', DATASET)) 
    found_ids['KO_ID'] = ", ".join(re.findall('KO; (K\d+);', DATASET)) 
    found_ids['GO_ID'] = ", ".join(re.findall('GO; (GO:\d+);', DATASET)) 
    found_ids['InterPro_ID'] = ", ".join(re.findall('InterPro; (IPR\d+);', DATASET)) 
    found_ids['InterPro_Description'] = ", ".join(re.findall('InterPro;.*?;(.*)\.', DATASET)) 
    found_ids['Pfam_ID'] = " ".join(re.findall('Pfam; (PF\d+);', DATASET)) 
    found_ids['Pfam_Description'] = ", ".join(re.findall('Pfam; PF\d+; (.*?);', DATASET)) 
    found_ids['PROSITE_ID'] = ", ".join(re.findall('PROSITE; (PS\d+);', DATASET)) 
    found_ids['PROSITE_Description'] = ", ".join(re.findall('PROSITE; PS\d+; (.*?);', DATASET)) 

    csvfile.writerow(found_ids) 

ofile.close() 
+0

請問你能否幫助正則表達式? –

+0

@ J.Carter:在你的Excel截圖中,你部分模糊了你的解析數據,所以我必須猜測。 –

+0

這裏是我的Excel表https://sites.google.com/site/iicbbioinformatics/share –