2013-07-04 21 views
1

我還有一個新手蟒蛇問題。我有一個如下所示的文件。我需要將它轉換爲像形式一樣的向量和指紋。對我來說,問題是如何組合文件,所以最後我有矩陣,其中行是cmps和列是val ...,如果val缺少comp,那麼等於零。 cmp的val是不同的,重疊不是很大。你可以請建議哪裏更好? Python字典?任何想法的幫助。謝謝!創建指紋文件在python

cmp1 0.277 val_1 
cmp1 0.097 val_2 
cmp1 0.795 val_3 
cmp1 0.809 val_4 
cmp1 0.127 val_5 
cmp2 0.839 val_3 
cmp2 0.909 val_4 
cmp2 0.148 val_5 
cmp2 0.938 val_6 
cmp2 0.599 val_7 

結果我neen接收....

矢量版本

name val_1 val_2 val_3 val_4 val_5 val_6 val_7 
cmp1 0.277 0.097 0.795 0.809 0.127 0 0 
cmp2 0 0 0.839 0.909 0.148 0.938 0.599 

二進制版本

name val_1 val_2 val_3 val_4 val_5 val_6 val_7 
cmp1 0 0 1 1 0 0 0 
cmp2 0 0 1 1 0 1 1 

當前代碼

import csv 

fi = open("data.txt", "rb") 
fo = open("data_out.txt", "wb") 
reader = csv.reader(fi,delimiter='\t') 
writer = csv.writer(fo,delimiter='\t') 

# making unique lists 
targets = set() 
ligands = set() 

for row in reader: 
    ligands.add(row[0]) 
    targets.add(row[2]) 

data = [] 
for row in reader: 
    if row[0] in ligands and row[2] in targets: 
    else: 

回答

2

你可以在這裏使用collections.defaultdict

from collections import defaultdict 
with open('abc') as f: 
    dic = defaultdict(dict) 
    for line in f: 
     cmp, val, col = line.split() 
     dic[cmp][col] = val 
print dic 
# defaultdict(<type 'dict'>, 
#{'cmp1': {'val_5': '0.127', 'val_4': '0.809', 'val_1': '0.277', 'val_3': '0.795', 'val_2': '0.097'}, 
# 'cmp2': {'val_5': '0.148', 'val_4': '0.909', 'val_7': '0.599', 'val_6': '0.938', 'val_3': '0.839'}}) 

#get a sroted list of all val_i from the dic   
vals = sorted(set(y for x in dic.itervalues() for y in x)) 

keys = sorted(dic) 
print "name {}".format("\t".join(vals)) 
for key in keys: 
    print "{} {}".format(key, "\t".join(dic[key].get(v,'0') for v in vals) ) 

輸出:

name val_1 val_2 val_3 val_4 val_5 val_6 val_7 
cmp1 0.277 0.097 0.795 0.809 0.127 0 0 
cmp2 0 0 0.839 0.909 0.148 0.938 0.599 

對於二進制版本,你可以試試:

print "name {}".format("\t".join(vals)) 
for key in keys: 
    strs = "\t".join(str(int(round(float(dic[key][v])))) if v in dic[key] else '0' for v in vals) 
    print "{} {}".format(key, strs) 

輸出:

name val_1 val_2 val_3 val_4 val_5 val_6 val_7 
cmp1 0 0 1 1 0 0 0 
cmp2 0 0 1 1 0 1 1 
+0

我使用Python2.7 - 和接收電流誤差。回溯(最近呼叫最後): 文件「」,第2行,在 ValueError:零長度字段名格式 –

+0

@JohnAmraph我再次測試了代碼,對我來說工作正常。 –

+0

用「{0} {1}」代替「{} {}」 - 做到了!謝謝! –