我需要從另一個文件(例如,'file2.fasta')更改爲相應頭文件的一個文件(例如'file1.fasta')中排序標頭。注意:1)儘管file1.fasta有一些從file2.fasta反向補充的序列,但我希望不修改序列。 2)file1.fasta序列來自不同的來源,這意味着標題顯示各種格式;我針對的只是幾種格式的修改。使用另一個文件中的文本替換一個文件中的文本
下面是示例file2.fasta報頭:
實施例的 所有各種報頭格式在file1.fasta>OFAS009268-RA-EXON07 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS009268-RA-EXON07,probes-probe:,probes-source:Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1
CATTGCAGCAACTAACAGAGTTGATATATTAGATCCAGCCCTTCTCCGATCAGGCAGGCTAGACAGAAAAATTGAATTTCCTCATCCAAATGAAGATGCCCGTGCTCGAATTATGCAAAT
>OFAS016134-RA-EXON02 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS016134-RA-EXON02,probes-probe:,probes-source:Anasa_tristis_comp3229_c0_seq1
AGGGCTTGTGATTCCCTTGAGCACATCGCAAGCCTCTGTTCTAGACAAAACATTCCACATTTGGTCAATAATGCTTTTGGTTTGCAAAGTGCACGTCTCATGCATTTAATTCAAGAGGCT
(那些針對修改是前兩個報頭):
>Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1_103_rc
CATTGCAGCAACTAACAGAGTTGATATATTAGATCCAGCCCTTCTCCGATCAGGCAGGCTAGACAGAAAAATTGAATTTCCTCATCCAAATGAAGATGCCCGTGCTCGAATTATGCAAAT
>Anasa_tristis_comp3229_c0_seq1_0_rc
AGGGCTTGTGATTCCCTTGAGCACATCGCAAGCCTCTGTTCTAGACAAAACATTCCACATTTGGTCAATAATGCTTTTGGTTTGCAAAGTGCACGTCTCATGCATTTAATTCAAGAGGCT
>ENSOFAS009761_p2 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS009761,probes-probe:2,probes-source:Anoplocnemis_curvipes_contig5129
TTAAGAATCTCGAGAAAACCCCTCAGGATGATGAATTACTTGAAATATATGCTCTCTATAAACAAGCAACTGTAGGAGACTGTGACACAAGTAAGCCTGGGATGTTTGATTTCAAAGGGA1
>uce-3225_p7 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:7,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410155,probes-global-end:410275,probes-local-start:0,probes-local-end:120
AAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGCCCGAGAACAACTTGGTATTGCCGAAAGGCGAGCCAACGCT
>Anasa_tristis_comp8051_c0_seq1_A_0
ATCCTCCTGATTGGGCAGAAATTTTGAACCATTTTCGAGGGTCTGAACTTCAGAATTATTTTACAAAAATTTTGGAGGATGACCTTAAAGCCCTTATCAAGCCTCAGTATGTCGACCAAA
>Anasa_tristis_comp8051_c0_seq1_B_0
TAACGTCCTAGGTTAGGTTTCTGTTTACCAGCTAAAATCTTGAGGGCTGTAGACTTTCCAATGCCATTAGTTCCAACCAGACCTAAAACTTCTCCTGGTCTTGGAATTGGAAGTCTGTGG
最後兩個與目標類似,但有一個額外的下劃線和一個字母。這些需要保持不變。任何以>uce
和>ENSOFAS
開頭的標題都應該單獨保留。然後,新修改的file1.fasta文件應該是這樣的:
>OFAS009268-RA-EXON07 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS009268-RA-EXON07,probes-probe:,probes-source:Clavigralla_tomentosicollis_gi_512427643_gb_GAJX01006991.1_OFAS009268-RA-EXON07
CATTGCAGCAACTAACAGAGTTGATATATTAGATCCAGCCCTTCTCCGATCAGGCAGGCTAGACAGAAAAATTGAATTTCCTCATCCAAATGAAGATGCCCGTGCTCGAATTATGCAAAT
>OFAS016134-RA-EXON02 |design:coreoidea-v1,designer:forthman,probes-locus:OFAS016134-RA-EXON02,probes-probe:,probes-source:Anasa_tristis_comp3229_c0_seq1_OFAS016134-RA-EXON02
AGGGCTTGTGATTCCCTTGAGCACATCGCAAGCCTCTGTTCTAGACAAAACATTCCACATTTGGTCAATAATGCTTTTGGTTTGCAAAGTGCACGTCTCATGCATTTAATTCAAGAGGCT
>ENSOFAS009761_p2 |design:coreoidea-v1,designer:forthman,probes-locus:ENSOFAS009761,probes-probe:2,probes-source:Anoplocnemis_curvipes_contig5129
TTAAGAATCTCGAGAAAACCCCTCAGGATGATGAATTACTTGAAATATATGCTCTCTATAAACAAGCAACTGTAGGAGACTGTGACACAAGTAAGCCTGGGATGTTTGATTTCAAAGGGA1
>uce-3225_p7 |design:hemiptera-v1,designer:faircloth,probes-locus:uce-3225,probes-probe:7,probes-source:halhal1,probes-global-chromo:Scaffold629,probes-global-start:410155,probes-global-end:410275,probes-local-start:0,probes-local-end:120
AAATCCATCAAGAAATACCAACAACAACTTAAGGATGTCCAGACCGCACTCGAGGAAGAACAAAGAGCTAGGGATGATGCCCGAGAACAACTTGGTATTGCCGAAAGGCGAGCCAACGCT
>Anasa_tristis_comp8051_c0_seq1_A_0
ATCCTCCTGATTGGGCAGAAATTTTGAACCATTTTCGAGGGTCTGAACTTCAGAATTATTTTACAAAAATTTTGGAGGATGACCTTAAAGCCCTTATCAAGCCTCAGTATGTCGACCAAA
>Anasa_tristis_comp8051_c0_seq1_B_0
TAACGTCCTAGGTTAGGTTTCTGTTTACCAGCTAAAATCTTGAGGGCTGTAGACTTTCCAATGCCATTAGTTCCAACCAGACCTAAAACTTCTCCTGGTCTTGGAATTGGAAGTCTGTGG
我有一個python腳本有人前提是我用了類似的情況(但對於不同格式的頭)。我對python語言並不熟悉,並且好奇這個腳本是否可以修改用於這個新目的。
#!/usr/bin/env python
import sys
import re
original_fn = sys.argv[1]
company_fn = sys.argv[2]
pattern = '(uce | ENSOFAS | _[AB]_[0-9]+$)'
map = {}
with open(original_fn, "r") as original_fh:
for line in original_fh:
if line.startswith('>'):
try:
(k, v) = line.strip().split('|')
# remove trailing space from key
k = k[:-1]
map[k] = v
except ValueError as err:
k = line.strip()
map[k] = None
with open(company_fn, "r") as company_fh:
for line in company_fh:
if line.startswith('>') and not re.search(pattern, line.strip()):
try:
(k, v) = line.strip().split('|')
# remove trailing character from key
k = k[:-1]
except ValueError as err:
k = line.strip()
if k not in map:
sys.stdout.write("%s\n" % (k))
else:
sys.stdout.write("%s |%s\n" % (k, map[k]))
else:
sys.stdout.write("%s" % (line))
請輸入文件的一個例子,從輸入文件的輸出。目前,根據您提供的內容,雖然非常詳細,但實際上很難看到您使用的是什麼。 –
我看不到我如何附加示例文件,但帖子中有這些示例。 –
你是說,>收到標題並且遺傳密碼包含在同一個文件中。我不認爲它很清楚,因此爲什麼沒有人回答你迄今爲止的答案。我認爲它是正則表達式問題的一個很好的例子。 –