2017-04-24 54 views
0

我不確定我是否適當地提出問題標題。但是,我試圖解釋下面的問題。如果你能想到這個問題,請建議適當的標題。如何找到兩個列表之間的匹配並根據匹配寫入輸出?

說我有兩種類型的列表數據:

list_headers = ['gene_id', 'gene_name', 'trans_id'] 
# these are the features to be mined from each line of `attri_values` 

attri_values = 

['gene_id "scaffold_200001.1"', 'gene_version "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"'] 
['gene_id "scaffold_200001.1"', 'gene_version "1"', 'trans_id "scaffold_200001.1"', 'transcript_version "1"', 'exon_number "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200001.1.exon1"', 'exon_version "1"'] 
['gene_id "scaffold_200002.1"', 'gene_version "1"', 'trans_id "scaffold_200002.1"', 'transcript_version "1"', 'exon_number "3"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200002.1.exon3"', 'exon_version "1"'] 

我努力使基礎上list in the headerattribute in the attri_values的匹配表。

output = open('gtf_table', 'w') 
output.write('\t'.join(list_headers) + '\n') # this will first write the header 

# then I want to read each line 
for values in attri_values: 
    for list in list_headers: 
     if values.startswith(list): 
      attr_id = ''.join([x for x in attri_values if list in x]) 
      attr_id = attr_id.replace('"', '').split(' ')[1] 
      output.write('\t' + '\t'.join([attr_id])) 

     elif not values.startswith(list): 
      attr_id = 'NA' 
      output.write('\t' + '\t'.join([attr_id])) 

     output.write('\n') 

問題:是,當從list of list_headers匹配字符串values of attri_values發現一切運作良好,但是當沒有比賽有很多重複的「NA」的。

最終預期的結果:

gene_id gene_name trans_id 
scaffold_200001.1 NA NA 
scaffold_200001.1 NA scaffold_200001.1 
scaffold_200002.1 NA scaffold_200002.1 

帖子編輯: 這個問題我怎麼寫了我的elif(因爲每一個非匹配會寫「NA」)。我試圖以不同的方式移動NA的條件,但沒有成功。 如果我刪除elif得到它作爲第輸出(NA丟失):

gene_id gene_name trans_id 
scaffold_200001.1 
scaffold_200001.1 scaffold_200001.1 
scaffold_200002.1 scaffold_200002.1 

回答

1

Python有字符串,你可以用它來遍歷每個attri_values每個列表頭find方法。嘗試使用此功能:

def Get_Match(search_space,search_string): 
    start_character = search_space.find(search_string) 

    if start_character == -1: 
     return "N/A" 
    else: 
     return search_space[(start_character + len(search_string)):] 

for i in range(len(attri_values_1)): 
    for j in range(len(list_headers)): 
     print Get_Match(attri_values_1[i],list_headers[j]) 
1

我使用的答案大熊貓

import pandas as pd 

# input data 
list_headers = ['gene_id', 'gene_name', 'trans_id'] 

attri_values = [ 
['gene_id "scaffold_200001.1"', 'gene_version "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"'], 
['gene_id "scaffold_200001.1"', 'gene_version "1"', 'trans_id "scaffold_200001.1"', 'transcript_version "1"', 'exon_number "1"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200001.1.exon1"', 'exon_version "1"'], 
['gene_id "scaffold_200002.1"', 'gene_version "1"', 'trans_id "scaffold_200002.1"', 'transcript_version "1"', 'exon_number "3"', 'gene_source "jgi"', 'gene_biotype "protein_coding"', 'transcript_source "jgi"', 'transcript_biotype "protein_coding"', 'exon_id "scaffold_200002.1.exon3"', 'exon_version "1"']] 

# process input data 
attri_values_X = [dict([tuple(b.split())[:2] for b in a]) for a in attri_values] 

# Create DataFrame with the desired columns 
df = pd.DataFrame(attri_values_X, columns=list_headers) 

# print dataframe 
print df 

輸出

   gene_id gene_name    trans_id 
0 "scaffold_200001.1"  NaN     NaN 
1 "scaffold_200001.1"  NaN "scaffold_200001.1" 
2 "scaffold_200002.1"  NaN "scaffold_200002.1" 

沒有大熊貓是很容易爲好。我已經給你attri_values_X,那麼你幾乎在那裏,只是從字典中刪除你不想要的鑰匙。

1

我設法寫一個函數,這將有助於解析您的數據。我試圖修改你發佈的原代碼,有什麼事在這裏複雜的是你存儲你的數據需要被解析的方式,反正我不是在一個位置來判斷,這裏是我的代碼:

def searchHeader(title, values): 
    """" 
    searchHeader(title, values) --> list 

    *Return all the words of strings in an iterable object in which title is a substring, 
    without including title. Else write 'N\A' for strings that title is not a substring. 
    Example: 
      >>> seq = ['spam and ham', 'spam is awesome', 'Ham is...!', 'eat cake but not pizza'] 
      >>> searchHeader('spam', attri_values) 
      ['and', 'ham', 'is', 'awesome', 'N\\A', 'N\\A'] 
    """ 
    res = [] 
    for x in values: 
     if title in x: 
      res.append(x) 
     else: 
      res.append('N\A')      # If no match found append N\A for every string in values 

    res = ' '.join(res) 
    # res = res.replace('"', '')     You can use this for your code or use it after you call the function on res 
    res = res.split(' ') 
    res = [x for x in res if x != title]   # Remove title string from res 
    return res 

正則表達式在這種情況下也可以很方便。使用此功能解析數據,然後格式化結果以寫入文件表。此函數只使用一個for循環和一個列表理解,在您的代碼中使用兩個嵌套的for循環和一個列表理解。

單獨通過每個頭字符串的功能,如以下:

for title in list_headers: 
    result = searchHeader(title, attri_values) 
    ...format as table... 
    ...write to file... 

如果有可能,可以考慮從一個簡單的列表移動到字典你attri_values,這樣你可以用組的字符串他們的標頭:

attri_values = {'header': ('data1', 'data2',...)} 

在我看來,這比使用列表更好。另外請注意,你的代碼中的list這個名字是壓倒一切的,這不是一件好事,這是因爲list實際上是創建列表的內建類。

+0

感謝您的回答。使用字典會很複雜,因爲這些只是大數據的一小部分。我認爲簡單的嵌套for循環會解決它。順便說一句,我得到'類型錯誤'result = searchHeader(list_headers,attri_values)' – everestial007

+0

@ everestial007我的壞!我應該將'title'而不是'list_headers'傳遞給函數:'result = searchHeader(title,attri_values)'。這可能是深夜編寫代碼的結果:P? – direprobs

+0

我瞭解電腦太多和/或睏倦的後果。順便說一句,代碼仍然無法爲我解決問題。我試着改變一些像**而不是'如果在x中的標題:'我認爲它應該'如果x.startswith(標題)'原因在那裏將不會有一個命中列表比較,除非所有字符串匹配* *。我也嘗試改變其他的東西,但沒有運氣。你能給我一個完整的工作例子嗎? - 這是可能的。請注意這個問題,以便更多關注這個問題。 – everestial007

相關問題