如何找到兩個列表之間的匹配並根據匹配寫入輸出？

我不確定我是否適當地提出問題標題。但是，我試圖解釋下面的問題。如果你能想到這個問題，請建議適當的標題。如何找到兩個列表之間的匹配並根據匹配寫入輸出？

說我有兩種類型的列表數據：

list_headers = ['gene_id', 'gene_name', 'trans_id'] 
# these are the features to be mined from each line of `attri_values` 

attri_values = 

['gene_id "scaffold_200001.1"', 'gene_version "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"'] 
['gene_id "scaffold_200001.1"', 'gene_version "1"', 'trans_id "scaffold_200001.1"', 'transcript_version "1"', 'exon_number "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200001.1.exon1"', 'exon_version "1"'] 
['gene_id "scaffold_200002.1"', 'gene_version "1"', 'trans_id "scaffold_200002.1"', 'transcript_version "1"', 'exon_number "3"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200002.1.exon3"', 'exon_version "1"']

我努力使基礎上list in the header和attribute in the attri_values的匹配表。

output = open('gtf_table', 'w') 
output.write('\t'.join(list_headers) + '\n') # this will first write the header 

# then I want to read each line 
for values in attri_values: 
    for list in list_headers: 
     if values.startswith(list): 
      attr_id = ''.join([x for x in attri_values if list in x]) 
      attr_id = attr_id.replace('"', '').split(' ')[1] 
      output.write('\t' + '\t'.join([attr_id])) 

     elif not values.startswith(list): 
      attr_id = 'NA' 
      output.write('\t' + '\t'.join([attr_id])) 

     output.write('\n')

問題：是，當從list of list_headers匹配字符串values of attri_values發現一切運作良好，但是當沒有比賽有很多重複的「NA」的。

最終預期的結果：

gene_id gene_name trans_id 
scaffold_200001.1 NA NA 
scaffold_200001.1 NA scaffold_200001.1 
scaffold_200002.1 NA scaffold_200002.1

帖子編輯： 這個問題我怎麼寫了我的elif（因爲每一個非匹配會寫「NA」）。我試圖以不同的方式移動NA的條件，但沒有成功。 如果我刪除elif得到它作爲第輸出（NA丟失）：

gene_id gene_name trans_id 
scaffold_200001.1 
scaffold_200001.1 scaffold_200001.1 
scaffold_200002.1 scaffold_200002.1

來源

2017-04-24 everestial007

Python有字符串，你可以用它來遍歷每個attri_values每個列表頭find方法。嘗試使用此功能：

def Get_Match(search_space,search_string): 
    start_character = search_space.find(search_string) 

    if start_character == -1: 
     return "N/A" 
    else: 
     return search_space[(start_character + len(search_string)):] 

for i in range(len(attri_values_1)): 
    for j in range(len(list_headers)): 
     print Get_Match(attri_values_1[i],list_headers[j])

來源

2017-04-24 20:01:18

我使用的答案大熊貓

import pandas as pd 

# input data 
list_headers = ['gene_id', 'gene_name', 'trans_id'] 

attri_values = [ 
['gene_id "scaffold_200001.1"', 'gene_version "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"'], 
['gene_id "scaffold_200001.1"', 'gene_version "1"', 'trans_id "scaffold_200001.1"', 'transcript_version "1"', 'exon_number "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200001.1.exon1"', 'exon_version "1"'], 
['gene_id "scaffold_200002.1"', 'gene_version "1"', 'trans_id "scaffold_200002.1"', 'transcript_version "1"', 'exon_number "3"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200002.1.exon3"', 'exon_version "1"']] 

# process input data 
attri_values_X = [dict([tuple(b.split())[:2] for b in a]) for a in attri_values] 

# Create DataFrame with the desired columns 
df = pd.DataFrame(attri_values_X, columns=list_headers) 

# print dataframe 
print df

輸出

   gene_id gene_name    trans_id 
0 "scaffold_200001.1"  NaN     NaN 
1 "scaffold_200001.1"  NaN "scaffold_200001.1" 
2 "scaffold_200002.1"  NaN "scaffold_200002.1"

沒有大熊貓是很容易爲好。我已經給你attri_values_X，那麼你幾乎在那裏，只是從字典中刪除你不想要的鑰匙。

來源

2017-04-24 21:11:16 Elmex80s

我設法寫一個函數，這將有助於解析您的數據。我試圖修改你發佈的原代碼，有什麼事在這裏複雜的是你存儲你的數據需要被解析的方式，反正我不是在一個位置來判斷，這裏是我的代碼：

def searchHeader(title, values): 
    """" 
    searchHeader(title, values) --> list 

    *Return all the words of strings in an iterable object in which title is a substring, 
    without including title. Else write 'N\A' for strings that title is not a substring. 
    Example: 
      >>> seq = ['spam and ham', 'spam is awesome', 'Ham is...!', 'eat cake but not pizza'] 
      >>> searchHeader('spam', attri_values) 
      ['and', 'ham', 'is', 'awesome', 'N\\A', 'N\\A'] 
    """ 
    res = [] 
    for x in values: 
     if title in x: 
      res.append(x) 
     else: 
      res.append('N\A')      # If no match found append N\A for every string in values 

    res = ' '.join(res) 
    # res = res.replace('"', '')     You can use this for your code or use it after you call the function on res 
    res = res.split(' ') 
    res = [x for x in res if x != title]   # Remove title string from res 
    return res

正則表達式在這種情況下也可以很方便。使用此功能解析數據，然後格式化結果以寫入文件表。此函數只使用一個for循環和一個列表理解，在您的代碼中使用兩個嵌套的for循環和一個列表理解。

單獨通過每個頭字符串的功能，如以下：

for title in list_headers: 
    result = searchHeader(title, attri_values) 
    ...format as table... 
    ...write to file...

如果有可能，可以考慮從一個簡單的列表移動到字典你attri_values，這樣你可以用組的字符串他們的標頭：

attri_values = {'header': ('data1', 'data2',...)}

在我看來，這比使用列表更好。另外請注意，你的代碼中的list這個名字是壓倒一切的，這不是一件好事，這是因爲list實際上是創建列表的內建類。

來源

2017-04-24 21:27:32 direprobs

感謝您的回答。使用字典會很複雜，因爲這些只是大數據的一小部分。我認爲簡單的嵌套for循環會解決它。順便說一句，我得到'類型錯誤'result = searchHeader（list_headers，attri_values）' – everestial007

@ everestial007我的壞！我應該將'title'而不是'list_headers'傳遞給函數：'result = searchHeader（title，attri_values）'。這可能是深夜編寫代碼的結果：P？ – direprobs

我瞭解電腦太多和/或睏倦的後果。順便說一句，代碼仍然無法爲我解決問題。我試着改變一些像**而不是'如果在x中的標題：'我認爲它應該'如果x.startswith（標題）'原因在那裏將不會有一個命中列表比較，除非所有字符串匹配* *。我也嘗試改變其他的東西，但沒有運氣。你能給我一個完整的工作例子嗎？ - 這是可能的。請注意這個問題，以便更多關注這個問題。 – everestial007

如何找到兩個列表之間的匹配並根據匹配寫入輸出？

回答

相關問題