有沒有什麼辦法根據模式刪除字符串中的重複字符串？

我用這個格式文件的工作：有沒有什麼辦法根據模式刪除字符串中的重複字符串？

=Cluster= 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22491.xml;spectrum=1074 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=2950 true 


=Cluster= 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=1876 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3479 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true 

=Cluster= 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22493.xml;spectrum=473 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22493.xml;spectrum=473 true

正如你可以看到，每一個SPEC線是不同的，但有兩個地方重複串頻譜的數量。我想要做的是將模式=Cluster=之間的每一塊信息，並檢查是否有頻譜值重複行。如果有多行重複，則除去一行。

輸出文件應該是這樣的：

=Cluster= 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22491.xml;spectrum=1074 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=2950 true 


=Cluster= 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=1876 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3479 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true 

=Cluster= 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22493.xml;spectrum=473 true

我用groupby從itertools模塊裏。我假設我的輸入文件叫做f_input.txt，輸出文件叫做new_file.txt，但是這個腳本也刪除了SPEC的單詞......而且我不知道我可以改變什麼，以便不這樣做。編號：新的條件。有時部分行號可能會發生變化，例如：

=Cluster= 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=1876 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3479 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true 
SPEC PRD000682;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true

正如您所看到的，最後一行已更改零件PRD號。一種解決方案是檢查光譜數字，並根據重複頻譜刪除線條。

這將是一個解決方案：

=Cluster= 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=1876 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3479 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true

來源

2017-02-24 Enrique

你問爲什麼你的代碼是不是會工作的任何代碼工作還是？ –

你可以嘗試迭代整個文件並逐行檢查，i = file.read（）。split（'\ n'），現在當我[1]在其他行像i [2]或i [3]時，然後刪除我，然後對整個拆分的字符串逐個執行此操作。但是，它會是很多代碼。我敢打賭會有一個很好的解決方案！ –

你的代碼工作正常，沒有看到任何問題 –

在Python最短溶液：P

import os 
os.system("""awk 'line != $0; { line = $0 }' originalfile.txt > dedup.txt""")

輸出：

=Cluster= 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22491.xml;spectrum=1074 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=2950 true 

=Cluster= 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=1876 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3479 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true 

=Cluster= 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22493.xml;spectrum=473 true

（如果你使用的是Windows，AWK可以很容易地與Gow安裝。）

來源

2017-02-24 16:21:31

非常容易的解決方案。謝謝！ – Enrique

請注意，只有重複連續時，此技巧纔有效。 –

這將打開包含原始代碼的文件，以及一個新的文件，將輸出每個組的唯一線路。

seen是set，非常適合查看是否已經存在某些內容。

data是list，並將跟蹤"=Cluster="組的迭代。

然後您只需查看每個組的每一行（在data內指定爲i）。

如果該行不在seen內，則會添加該行。

with open ("input file", 'r') as in_file, open("output file", 'w') as out_file: 
    data = [k.rstrip().split("=Cluster=") for k in in_file] 
    for i in data: 
     seen = set() 
     for line in i: 
      if line in seen: 
       continue 
      seen.add(line) 
      out_file.write(line)

編輯：感動seen=set()到for i in data內重置設定每次否則"=Cluster="將始終存在並在data不會打印每個組。

來源

2017-02-24 15:13:52 pstatix

是的，看起來很酷，你試過的代碼？ –

你必須重置'seen'集合。 –

@ Ev。當你發佈這個時，我正在更新Kounis。意識到我錯了！ – pstatix

這就是我該怎麼做的。

file_in = r'someFile.txt' 
file_out = r'someOtherFile.txt' 
with open(file_in, 'r') as f_in, open(file_out, 'w') as f_out: 
    seen_spectra = set() 
    for line in f_in: 
     if '=Cluster=' in line or line.strip() == '': 
      seen_spectra = set() 
      f_out.write(line) 
     else: 
      new_spectrum = line.rstrip().split('=')[-1].split()[0] 
      if new_spectrum in seen_spectra: 
       continue 
      else: 
       f_out.write(line) 
       seen_spectra.add(new_spectrum)

這不是一個groupby的解決方案，但你可以輕鬆地跟蹤和調試，如果你有一個解決方案。正如你在評論中提到的那樣，你的這個文件是16GB大並且將其加載到內存中可能不是最好的主意。

EDIT: "Each cluster has a specific spectrum. It is not possible to have one spec in one cluster and the same in another"

file_in = r'someFile.txt' 
file_out = r'someOtherFile.txt' 
with open(file_in, 'r') as f_in, open(file_out, 'w') as f_out: 
    seen_spectra = set() 
    for line in f_in: 
     if line.startswith('SPEC'): 
      new_spectrum = line.rstrip().split('=')[-1].split()[0] 
      if spectrum in seen_spectra: 
       continue 
      else: 
       seen_spectra.add(new_spectrum)  
       f_out.write(line)   
     else: 
      f_out.write(line)

來源

2017-02-24 15:17:36

是的。你的代碼工作完美。謝謝！ – Enrique

嗨Ev。 Kounis。我只是跟我的主管交談，他說我內部= Cluster =的模式應該是spectrum = number，因爲（例如PRD0013和PRD0014）的數字可以改變，但不是譜數，所以腳本不會考慮這個重複。我怎麼能改變你的腳本來考慮頻譜部分？ – Enrique

@ Enrique恐怕我不明白.. –

使用re.search()功能和定製spectrums組對象中的溶液用於保持僅獨特spectrum數字：

with open('f_input.txt') as oldfile, open('new_file.txt', 'w') as newfile: 
    spectrums = set() 
    for line in oldfile: 
     if '=Cluster=' in line or not line.strip(): 
      newfile.write(line) 
     else: 
      m = re.search(r'spectrum=(\d+)', line) 
      spectrum = m.group(1) 
      if spectrum not in spectrums: 
       spectrums.add(spectrum) 
       newfile.write(line)

來源

2017-02-24 15:33:11 RomanPerekhrest

我得到了這個錯誤：AttributeError：'NoneType'對象沒有屬性'組' – Enrique

@ Enrique，有什麼意義？您已經接受了他的回答 – RomanPerekhrest

我正在比較幾種解決方案並查看哪種解決方案效率最高。 – Enrique

有沒有什麼辦法根據模式刪除字符串中的重複字符串？

回答

相關問題