2017-04-21 141 views
3

我有一個文件,如下面的小例子。每4行都與一個ID相關。每個ID的第二行以N開頭。我想在行首開始刪除N,其他所有內容都保持不變。 我想在python中做到這一點。你知道怎麼做嗎?如何在Python中編輯文本(.fastq)文件

例如:

@SRR2163140.1 HISEQ:148:C670LANXX:3:1101:1302:1947 length=50 
NGCGACCTCAGATCAGACGTGGCGACC 
+SRR2163140.1 HISEQ:148:C670LANXX:3:1101:1302:1947 length=50 
#<<ABGGGGGGGGGGGGGGGGGGGGGG 
@SRR2163140.3 HISEQ:148:C670LANXX:3:1101:1381:1997 length=50 
NGCCGACATCGAAGGATCAA 
+SRR2163140.3 HISEQ:148:C670LANXX:3:1101:1381:1997 length=50 
#<<ABFGGGGGGGGGGGGGG 
@SRR2163140.4 HISEQ:148:C670LANXX:3:1101:1705:1940 length=50 
NACAAACCCTTGTGTCGAGGGC 
+SRR2163140.4 HISEQ:148:C670LANXX:3:1101:1705:1940 length=50 
#=ABBGGGGGGGGGGGGGGGGG 
@SRR2163140.7 HISEQ:148:C670LANXX:3:1101:1704:1965 length=50 
NGGGACATGACAGCCTGGACCATCG 
+SRR2163140.7 HISEQ:148:C670LANXX:3:1101:1704:1965 length=50 
#=ABBGGGGGGGGGGGGGGGGGGGG 

輸出:

@SRR2163140.1 HISEQ:148:C670LANXX:3:1101:1302:1947 length=50 
GCGACCTCAGATCAGACGTGGCGACC 
+SRR2163140.1 HISEQ:148:C670LANXX:3:1101:1302:1947 length=50 
#<<ABGGGGGGGGGGGGGGGGGGGGGG 
@SRR2163140.3 HISEQ:148:C670LANXX:3:1101:1381:1997 length=50 
GCCGACATCGAAGGATCAA 
+SRR2163140.3 HISEQ:148:C670LANXX:3:1101:1381:1997 length=50 
#<<ABFGGGGGGGGGGGGGG 
@SRR2163140.4 HISEQ:148:C670LANXX:3:1101:1705:1940 length=50 
ACAAACCCTTGTGTCGAGGGC 
+SRR2163140.4 HISEQ:148:C670LANXX:3:1101:1705:1940 length=50 
#=ABBGGGGGGGGGGGGGGGGG 
@SRR2163140.7 HISEQ:148:C670LANXX:3:1101:1704:1965 length=50 
GGGACATGACAGCCTGGACCATCG 
+SRR2163140.7 HISEQ:148:C670LANXX:3:1101:1704:1965 length=50 
#=ABBGGGGGGGGGGGGGGGGGGGG 
+1

請注意,要獲得有效的fastq格式,您還需要刪除質量行的第一個字符。你想要的不會保留基礎和品質之間的匹配。 – bli

回答

4

如果我會做你問到底是什麼(請從每個序列的起始N),那麼這將離開FASTQ file不一致的狀態。

FASTQ文件的每一行都包含較早序列兩行的質量值。所以,如果您從序列中刪除第一個字符,則還需要使用質量值從行中刪除第一個字符。

你可以做一些在純Python非常簡單的像

with open("example.fastq") as f: 
    for idx, line in enumerate(f.read().splitlines()): 
     if idx % 2: 
      print(line[1:]) 
     else: 
      print(line) 

,但如果你要與生物數據能正常運行,你真的應該開始使用生物信息學模塊像BioPython。它會警告你,如果你試圖做的事情會導致文件的形狀不一致或沒有意義。

然後將溶液看起來像:

from Bio import SeqIO 
from Bio import Seq 

new_records = [] 
for record in SeqIO.parse("example.fastq", "fastq"): 
    sequence = str(record.seq) 
    letter_annotations = record.letter_annotations 

    # You first need to empty the existing letter annotations 
    record.letter_annotations = {} 

    new_sequence = sequence[1:] 
    record.seq = Seq.Seq(new_sequence) 


    new_letter_annotations = {'phred_quality': letter_annotations['phred_quality'][1:]} 
    record.letter_annotations = new_letter_annotations 

    new_records.append(record) 


with open('without_starting_N.fastq', 'w') as output_handle: 
    SeqIO.write(new_records, output_handle, "fastq") 

其上的每個第三行輸出

@SRR2163140.1 HISEQ:148:C670LANXX:3:1101:1302:1947 length=50 
GCGACCTCAGATCAGACGTGGCGACC 
+ 
<<ABGGGGGGGGGGGGGGGGGGGGGG 
@SRR2163140.3 HISEQ:148:C670LANXX:3:1101:1381:1997 length=50 
GCCGACATCGAAGGATCAA 
+ 
<<ABFGGGGGGGGGGGGGG 
@SRR2163140.4 HISEQ:148:C670LANXX:3:1101:1705:1940 length=50 
ACAAACCCTTGTGTCGAGGGC 
+ 
=ABBGGGGGGGGGGGGGGGGG 
@SRR2163140.7 HISEQ:148:C670LANXX:3:1101:1704:1965 length=50 
GGGACATGACAGCCTGGACCATCG 
+ 
=ABBGGGGGGGGGGGGGGGGGGGG 

(即「+」字符是任選隨後通過從兩個相同序列標識符和描述前面的行)