分割行分成多個小區，並保持第二值的最大值爲每個基因

我新的Python和我製備的腳本，將相應地修改以下csv file 分割行分成多個小區，並保持第二值的最大值爲每個基因

：

1）中的每一行包含由分隔的多個基因的條目///如：

C16orf52 /// LOC102725138 1.00551

應該被變換爲：

C16orf52 1.00551 
LOC102725138 1.00551

2）同一個基因可能有不同的比例值

AASDHPPT 0.860705 
AASDHPPT 0.983691

，我們希望只保留對具有最高比值（刪除對AASDHPPT 0.860705）

這裏是我寫的劇本但它不會將正確的比率值分配給基因：

import csv 
import pandas as pd 

with open('2column.csv','rb') as f: 
    reader = csv.reader(f) 
    a = list(reader) 
gene = [] 
ratio = [] 
for t in range(len(a)): 
    if '///' in a[t][0]: 
     s = a[t][0].split('///') 
     gene.append(s[0]) 
     gene.append(s[1]) 
     ratio.append(a[t][1]) 
     ratio.append(a[t][1]) 
    else: 
     gene.append(a[t][0]) 
     ratio.append(a[t][1]) 
    gene[t] = gene[t].strip() 

newgene = [] 
newratio = [] 
for i in range(len(gene)): 
    g = gene[i] 
    r = ratio[i] 
    if g not in newgene: 
     newgene.append(g) 
    for j in range(i+1,len(gene)): 
     if g==gene[j]: 
      if ratio[j]>r: 
       r = ratio[j] 
    newratio.append(r) 

for i in range(len(newgene)): 
    print newgene[i] + '\t' + newratio[i] 

if len(newgene) > len(set(newgene)): 
    print 'missionfailed'

非常感謝您的任何幫助或建議。

來源

2017-07-03 Python kindergarten developer

嗨馬諾利斯，可能是你應該瞭解[如何創建一個最小的，完整的，並且可驗證的示例]（https://stackoverflow.com/help/mcve） – danihp

我覺得你非常可能要將基因存儲在一個字典中，當分配值時，如果該鍵退出，則忽略它是否不大於當前值。 – Peter

試試這個：

所有的

with open('2column.csv') as f: 
    lines = f.read().splitlines() 

new_lines = {} 
for line in lines: 
    cols = line.split(',') 
    for part in cols[0].split('///'): 
     part = part.strip() 
     if not part in new_lines: 
      new_lines[part] = cols[1] 
     else: 
      if float(cols[1]) > float(new_lines[part]): 
       new_lines[part] = cols[1] 


import csv 
with open('clean_2column.csv', 'wb') as csvfile: 
    writer = csv.writer(csvfile, delimiter=' ', 
          quotechar='|', quoting=csv.QUOTE_MINIMAL) 
    for k, v in new_lines.items(): 
     writer.writerow([k, v])

來源

2017-07-03 13:35:41

謝謝你的幫助。但是，出現以下錯誤：Traceback（最近調用最後一次）：文件「gsea.py」，第10行，在 new_lines [part] = cols [1] IndexError：list index out of range What do you建議？ –

這可能是因爲您在與您共享的文件不同的csv文件上進行了測試。（檢查你是否有相同的分隔符'，'） –

首先，如果你要導入熊貓，知道你有I/O Tools閱讀CSV文件。

因此，首先，讓我們來導入這種方式：

df = pd.read_csv('2column.csv')

然後，您可以提取你有你 '///' 模式索引：

l = list(df[df['Gene Symbol'].str.contains('///')].index)

然後，您可以創建新行：

for i in l : 
    for sub in df['Gene Symbol'][i].split('///') : 
     df=df.append(pd.DataFrame([[sub, df['Ratio(ifna vs. ctrl)'][i]]], columns = df.columns))

然後，刪除舊：

df=df.drop(df.index[l])

然後，我會做一些小技巧來刪除最低重複值。首先，我將通過「比（α-干擾素對CTRL）」進行排序，然後我會drop all the duplicates但第一個：

df = df.sort('Ratio(ifna vs. ctrl)', ascending=False).drop_duplicates('Gene Symbol', keep='first')

如果你想通過基因符號，讓您排序和復位指標有更簡單的，簡單地做：

df = df.sort('Gene Symbol').reset_index(drop=True)

如果您想您的修改後的數據重新導出到CSV，做到：

df.to_csv('2column.csv')

編輯：我編輯我的答案正確的語法錯誤，我已經測試了這個解決方案與您的CSV和它的工作完美:)

來源

2017-07-03 13:58:14

這應該工作。

它使用彼得的字典建議。

import csv 

with open('2column.csv','r') as f: 
    reader = csv.reader(f) 
    original_file = list(reader) 
    # gets rid of the header 
    original_file = original_file[1:] 

# create an empty dictionary 
genes_ratio = {} 

# loop over every row in the original file 
for row in original_file: 
    gene_name = row[0] 
    gene_ratio = row[1] 
    # check if /// is in the string if so split the string 
    if '///' in gene_name: 
     gene_names = gene_name.split('///') 
     # loop over all the resulting compontents 
     for gene in gene_names: 
      # check if the component is in the dictionary 
      # if not in dictionary set value to gene_ratio 
      if gene not in genes_ratio: 
       genes_ratio[gene] = gene_ratio 
      # if in dictionary compare value in dictionary to gene_ratio 
      # if dictionary value is smaller overwrite value 
      elif genes_ratio[gene] < gene_ratio: 
       genes_ratio[gene] = gene_ratio 
    else: 
     if gene_name not in genes_ratio: 
      genes_ratio[gene_name] = gene_ratio 
     elif genes_ratio[gene_name] < gene_ratio: 
      genes_ratio[gene_name] = gene_ratio 

#loop over dictionary and print gene names and their ratio values 
for key in genes_ratio: 
    print key, genes_ratio[key]

來源

2017-07-03 14:29:57 error

分割行分成多個小區，並保持第二值的最大值爲每個基因

回答

相關問題