比較從第一字典值從第二詞典

我有一個大的數據庫文件（姑且稱之爲db.csv）包含許多信息。

簡化數據庫文件來說明：

我在我的基因序列運行usearch61 -cluster_fast爲了將羣集他們。
我得到了一個名爲「clusters.uc」文件。我打開了它爲csv，然後我做了一個代碼來創建一個字典（假設dict_1）有作爲值我的簇號鍵和我的gene_id（VFG ...）。
這裏是我做了什麼，然後存儲在一個文件中的一個例子：dict_1

0 ['VFG003386', 'VFG034084', 'VFG003381'] 
1 ['VFG000838', 'VFG000630', 'VFG035932', 'VFG000636'] 
2 ['VFG018349', 'VFG018485', 'VFG043567'] 
... 
14471 ['VFG015743', 'VFG002143']

到目前爲止好。然後使用db.csv我又字典（dict_2）中gene_id（VFG ...）是鍵和VF_Accession（IA ...或CVF ..或VF ...）是值，例證：dict_2

VFG044259 IA027 
VFG044258 IA027 
VFG011941 CVF397 
VFG012016 CVF399 
...

我到底想要什麼是對每個VF_Accession羣集組的數量，插圖：

IA027 [0,5,6,8] 
CVF399 [15, 1025, 1562, 1712] 
...

所以我想，因爲我仍然在編碼初學者，我需要創造一個比較代碼從dict_1（VFG ...）到dict_2（VFG ...）的鍵值。如果它們匹配，則將VF_Accession作爲關鍵字，並將所有簇號作爲值。由於VF_Accession是鍵不能重複的，我需要一個列表字典。我想我可以做到這一點，因爲我爲dict_1製作了它。但我的問題是，我無法找出一種方法來比較dict_1中的值和dict_2中的鍵值，並將每個VF_Accession值放入一個簇號。請幫幫我。

來源

2017-07-19 rookie max

我不對生物有很多瞭解 - 同一個gene_id（VFG）能夠出現在多個集羣中嗎？ –

是的，其中有些是不幸的。也許有類似的IA027 [0 | 12，5，6，8]或IA027 [0（12），5,6,8] –

首先，讓我們給你的字典一些更好的名字，然後dict_1,dict_2，...使它更容易與它們一起工作，並記住它們包含的內容。

你先創建一個具有羣集號作爲鍵和gene_ids字典（VFG ...）作爲值：

cluster_nr_to_gene_ids = {0: ['VFG003386', 'VFG034084', 'VFG003381', 'VFG044259'], 
          1: ['VFG000838', 'VFG000630', 'VFG035932', 'VFG000636'], 
          2: ['VFG018349', 'VFG018485', 'VFG043567', 'VFG012016'], 
          5: ['VFG011941'], 
          7949: ['VFG003386'],        
          14471: ['VFG015743', 'VFG002143', 'VFG012016']}

而且有你也是另一個字典，其中gene_ids是鍵和VF_Accessions（IA ...或CVF ..或VF ...）的值：

gene_id_to_vf_accession = {'VFG044259': 'IA027', 
          'VFG044258': 'IA027', 
          'VFG011941': 'CVF397', 
          'VFG012016': 'CVF399', 
          'VFG000676': 'VF0142', 
          'VFG002231': 'VF0369', 
          'VFG003386': 'CVF051'}

而且我們要創建一個字典，其中每個VF_Accession鍵與價值羣集組的數字：vf_accession_to_cluster_groups。

我們還注意到，VF加入屬於多基因標識（例如：在VF加入IA027既有VFG044259和VFG044258基因標識

因此我們使用defaultdict，以與VF加入字典作爲。鍵和值

from collections import defaultdict 
vf_accession_to_gene_ids = defaultdict(list) 
for gene_id, vf_accession in gene_id_to_vf_accession.items(): 
    vf_accession_to_gene_ids[vf_accession].append(gene_id)

對於我上面張貼的樣本數據基因標識的列表，vf_accession_to_gene_ids現在看起來像：

defaultdict(<class 'list'>, {'VF0142': ['VFG000676'], 
          'CVF051': ['VFG003386'], 
          'IA027': ['VFG044258', 'VFG044259'], 
          'CVF399': ['VFG012016'], 
          'CVF397': ['VFG011941'], 
          'VF0369': ['VFG002231']})

現在我們可以遍歷每個VF Accession並查找其基因ID列表。然後，對於每個基因ID，我們遍歷每個簇，看看基因ID存在有：

vf_accession_to_cluster_groups = {} 
for vf_accession in vf_accession_to_gene_ids: 
    gene_ids = vf_accession_to_gene_ids[vf_accession] 
    cluster_group = [] 
    for gene_id in gene_ids: 
     for cluster_nr in cluster_nr_to_gene_ids: 
      if gene_id in cluster_nr_to_gene_ids[cluster_nr]: 
       cluster_group.append(cluster_nr) 
    vf_accession_to_cluster_groups[vf_accession] = cluster_group

用於上述採樣數據的最終結果現在是：

{'VF0142': [], 
'CVF051': [0, 7949], 
'IA027': [0], 
'CVF399': [2, 14471], 
'CVF397': [5], 
'VF0369': []}

來源

2017-07-19 10:53:27 BioGeek

我真的真的很感謝您的幫助，但是如果您可以幫助我更多，還是會有一些問題：在cluster_nr_to_gene_ids中，同一個gene_id可以有多個簇號。插圖：0 ['VFG003386'] 7949 ['VFG003386']所以vf_accession應該包含這兩個羣集組。 CVF051 ['0,7949']但它只給我一個：CVF051 [0] –

@rookiemax，我的代碼在基因ID處於多個羣集時起作用，請參閱我用您提供的示例更新的示例數據。看到你做錯了什麼，或者你需要提供一個更完整的數據集來查看事情出錯的地方。 – BioGeek

你是對的我做錯了什麼，我的壞。在我刪除了一行代碼後，它非常完美：D我真的非常棒，謝謝你的幫助：D真的很坦然：D –

警告：我沒有做很多Python開發，所以有可能是一個更好的方式來做到這一點。您可以gene_ids首先映射你的VFG ......他們的簇號，然後用它來處理第二詞典：

from collections import defaultdict 
import sys 
import ast 

# see https://stackoverflow.com/questions/960733/python-creating-a-dictionary-of-lists 
vfg_cluster_map = defaultdict(list) 

# map all of the vfg... keys to their cluster numbers first 
with open(sys.argv[1], 'r') as dict_1: 
    for line in dict_1: 
     # split the line at the first space to separate the cluster number and gene ID list 
     # e.g. after splitting the line "0 ['VFG003386', 'VFG034084', 'VFG003381']", 
     # cluster_group_num holds "0", and vfg_list holds "['VFG003386', 'VFG034084', 'VFG003381']" 
     cluster_group_num, vfg_list = line.strip().split(' ', 1) 
     cluster_group_num = int(cluster_group_num) 

     # convert "['VFG...', 'VFG...']" from a string to an actual list 
     vfg_list = ast.literal_eval(vfg_list) 
     for vfg in vfg_list: 
      vfg_cluster_map[vfg].append(cluster_group_num) 

# you now have a dictionary mapping gene IDs to the clusters they 
# appear in, e.g 
# {'VFG003386': [0], 
# 'VFG034084': [0], 
# ...} 
# you can look in that dictionary to find the cluster numbers corresponding 
# to your vfg... keys in dict_2 and add them to the list for that vf_accession 
vf_accession_cluster_map = defaultdict(list) 
with open(sys.argv[2], 'r') as dict_2: 
    for line in dict_2: 
     vfg, vf_accession = line.strip().split(' ') 

     # add the list of cluster numbers corresponding to this vfg... to 
     # the list of cluster numbers corresponding to this vf_accession 
     vf_accession_cluster_map[vf_accession].extend(vfg_cluster_map[vfg]) 

for vf_accession, cluster_list in vf_accession_cluster_map.items(): 
    print vf_accession + ' ' + str(cluster_list)

然後保存上面的腳本並調用它像python <script name> dict1_file dict2_file > output（或者你可以寫的字符串一個文件而不是打印它們並重定向）。

編輯：看@BioGeek的答案後，我應該注意，它會更有意義的處理這一切在一槍，而不是創建dict_1和dict_2文件，閱讀它們，解析線回數字和列表，等等。如果你不需要字典先寫一個文件，那麼你可以添加其他的代碼腳本，並直接使用的字典。

來源

2017-07-19 09:59:20

我很感謝你的幫助：D –

我實際上使用你的一些代碼來解決我的問題。另外我學到了python編碼的新東西，所以再次感謝：D –

很高興聽到它有幫助！ –

比較從第一字典值從第二詞典

回答

相關問題