自定義格式ID映射

我有兩個數據庫（txt文件）。一個是兩列，製表符分隔的名稱和ID。自定義格式ID映射

name1 \t ID1 
name1 \t ID2 
name2 \t ID9 
name2 \t ID40 
name3 \t ID3

其他數據庫具有相同的ID作爲第一列第一位的，而第二列列出用逗號分隔的同種（這是在第一個的那些子女的標識，如第二個數據庫是分層的）。

ID1 \t ID1,ID2,ID3 
ID2 \t ID2, ID9

我想這樣做的就是用相同的格式第二第三數據庫，但在第二列，我想給孩子們的ID換出的第一個數據庫的名稱。例如：

ID1 \t name1,name2,name3 
ID2 \t name1,name2

有沒有辦法做到這一點？我是初學者，當我在使用Web服務之前必須映射ID時，但這是進一步分析所需的自定義格式，我不確定從哪裏開始。

在此先感謝！

來源

2016-07-25 Márton Oelbei

Databases = tables？桌子平面文件？ –

是的，他們是簡單的txt文件，對不起，如果我不清楚。 –

這可能太籠統了，特別是在python，r和bash中標記問題。 – dayne

import csv 

# Reading the first db is simple since there's only a fixed delimiter 
# Use csv module to split the lines and create a dictionary that maps id to name 

id_dictionary = {} 
with open('db_1.txt', 'r') as infile: 
    reader = csv.reader(infile, delimiter='\t') 
    for line in reader: 
     id_dictionary[line[1]] = line[0] 

# We can again split on tab but that will return 'name1,name2' etc as a single 
# string that we call split() on later. 

row_data = [] 
with open('db_2.txt', 'r') as infile: 
    reader = csv.reader(infile, delimiter='\t') 
    for line in reader: 
     # ID remains unchanged, so keep the first value 
     row = [line[0]] 

     # Split the string into individual elements in a list 
     id_codes = line[1].split(',') 

     # List comprehension to look for ID in the dictionary and return the 
     # name stored against it 
     translated = [id_dictionary.get(item) for item in id_codes] 

     # Add translated to the list that we are using to represent a row 
     row.extend(translated) 

     # Append the row to our collection of rows 
     row_data.append(row) 

with open('db_3.txt', 'w') as outfile: 
    for row in row_data: 
     outfile.write(row[0]) 
     outfile.write('\t') 
     outfile.write(','.join(map(str,row[1:]))) # Join values by a comma 
     outfile.write('\n')

來源

2016-07-25 15:37:46 roganjosh

非常感謝，這工作完美，我能夠理解它:)。我添加了'import.sys'和'csv.field_size_limit（sys.maxsize）'使它工作，因爲文本文件有一些非常大的字段。 –

@MártonOelbei我沒有意識到這些文件是那麼大。當然，值得看看不同的存儲系統，不管是數據庫還是類似cpickle文件的輸出，而不是另一個文本文件？否則，當你每次想要查找某些東西時，都會逐行掃描所有內容。 – roganjosh

你可以試試這個一行awk腳本：

awk -v FS="\t|," -v OFS="," 'FILENAME=="file_name.txt" {str[$2]=$1;next;} {for(i=2;i<=NF;i++) {sub($i,str[$i],$i)};a=$1;$1="";print a"\t"$0}' file_name.txt fileID.txt|sed -e 's/,//' -e 's/,$//'

的「file_name.txt」爲AWK是txt文件，其第一列具有「1，名稱...」，而「fileID.txt」在第一列中具有「ID1，ID2，...」

sed將在列表的開始和結尾修剪不必要的逗號。

來源

2016-07-25 16:04:15 AwkMan

#suppose database files are f1.txt,f2.txt,f3.txt 
#use set to get key-value format datas 
def getArr(f): 
    i=f.readline() 
    arr=[] 
    while i: 
     i=i.replace('\n','') 
     arr.append(i.split('\t')) 
     i=f.readline() 
    return arr 
if __name__=="__main__": 
    f1=file("f1.txt") 
    f2=file("f2.txt") 
    f3=open('f3.txt','w') 
    arr1=getArr(f1) 
    arr2=getArr(f2) 
    dic={} 
    for array in arr1: 
     dic[array[1]]=array[0] 
    for i in arr2: 
     keys=i[1].split(',') 
     print keys 
     line=i[0]+'\t' 
     for key in keys: 
      line+=dic.get(key)+',' 
     line=line[:-1]+'\n' 
     f3.write(line) 
    f1.close() 
    f2.close() 
    f3.close()

來源

2016-07-25 16:09:31

我不確定相比於我之前提交的答案，這個增加了多少。它沒有提供任何解釋，也沒有使用上下文管理器來處理這些文件，而且真的讓人難以理解。 'line + = dic.get（key）+'，''根本不是好習慣。 – roganjosh

自定義格式ID映射

回答

相關問題