2013-03-27 64 views
0

我有兩個文本文件。匹配不同的列並將它們用python結合起來

首先是空間分隔的列表:

23 dog 4 
24 cat 5 
28 cow 7 

二是'|' - 分隔列表:

?dog|parallel|numbering|position 
Dogsarebarking 
?cat|parallel|nuucers|position 
CatisBeautiful 

我想類似以下的輸出文件:

?dog|paralle|numbering|position|23 
?cat|parallel|nuucers|position|24 

這是一個'|'-包含第二個文件應用程序的值的分離列表以第一個文件的第一列中的值與兩個文件的第二列中的值匹配的值結束。

+1

看起來像第二列 – moooeeeep 2013-03-27 15:58:10

+0

JOIN ..等等,什麼?所有這些沒有管道的管線從哪裏來? – DSM 2013-03-27 16:53:28

+0

我是我的巨大的不同動物文件的數據集,只有一個文本文件包含這樣的數據,所以我想單獨處理它 – Rocket 2013-03-27 16:55:30

回答

3

使用csv來讀取第一個文件和一個字典來存儲file1行。第二個文件是FASTA格式,所以我們只需要開始?行:

import csv 

with open('file1', 'rb') as file1: 
    file1_data = dict(line.split(None, 2)[1::-1] for line in file1 if line.strip()) 

with open('file2', 'rb') as file2, open('output', 'wb') as outputfile: 
    output = csv.writer(outputfile, delimiter='|') 
    for line in file2: 
     if line[:1] == '?': 
      row = line.strip().split('|') 
      key = row[0][1:] 
      if key in file1_data: 
       output.writerow(row + [file1_data[key]]) 

這將產生:

?dog|parallel|numbering|position|23 
?cat|parallel|nuucers|position|24 

您的輸入例子。

+0

看起來他想用第一個文件作爲映射,而不僅僅是將它們壓縮在一起。 – interjay 2013-03-27 16:00:38

+0

@interjay:是的,糾正。 – 2013-03-27 16:00:56

+0

@MartijnPieters他正在使用文件處理,所以輸出應該在輸出文本文件中,以及+1回答#1 – 2013-03-27 16:09:43

3

這是哪門子的任務,在該pandas庫擅長:

import pandas as pd 
df1 = pd.read_csv("c1.txt", sep="|", header=None).dropna() 
df2 = pd.read_csv("c2.txt", sep=" ", header=None) 
merged = df1.merge(df2, on=1).ix[:,:-1] 
merged.to_csv("merged.csv", sep="|", header=None, index=None) 

一些解釋如下。首先,我們在文件中讀取,爲對象稱爲DataFrames:

>>> df1 = pd.read_csv("c1.txt", sep="|", header=None).dropna() 
>>> df1 
       0  1   2   3 
0  ?parallel dog numbering position 
3  ?parallel cat nuucers position 
6 ?non parallel honey numbering position 
>>> df2 = pd.read_csv("c2.txt", sep=" ", header=None) 
>>> df2 
    0 1 2 
0 23 dog 4 
1 24 cat 5 
2 28 cow 7 

.dropna()跳過那裏沒有任何數據情況。或者,df1 = df1[df1[0].str.startswith("?")]應該是另一種方式。

然後我們把它們合併第一列:

>>> df1.merge(df2, on=1) 
     0_x 1  2_x   3 0_y 2_y 
0 ?parallel dog numbering position 23 4 
1 ?parallel cat nuucers position 24 5 

我們不需要那麼最後一列,所以我們分析它:

>>> df1.merge(df2, on=1).ix[:,:-1] 
     0_x 1  2_x   3 0_y 
0 ?parallel dog numbering position 23 
1 ?parallel cat nuucers position 24 

,然後我們使用to_csv寫出來,生產:

>>> !cat merged.csv 
?parallel|dog|numbering|position|23 
?parallel|cat|nuucers|position|24 

現在,對於很多簡單的任務,pandas可以矯枉過正,學習如何使用csv模塊等更低級別的工具也很重要。 OTOH,當你只想完成某件事時,它非常非常方便。

+0

我必須得出結論,熊貓是相當真棒。 – moooeeeep 2013-03-27 16:32:28

0

這似乎正是JOIN適用於關係數據庫。

An inner join is the most common join operation used in applications and can be regarded as the default join-type. Inner join creates a new result table by combining column values of two tables (A and B) based upon the join-predicate. The query compares each row of A with each row of B to find all pairs of rows which satisfy the join-predicate. When the join-predicate is satisfied, column values for each matched pair of rows of A and B are combined into a result row.

有一個看看這個例子:

import sqlite3 
conn = sqlite3.connect('example.db') 

# get hands on the database 
c = conn.cursor() 

# create and populate table1 
c.execute("DROP TABLE table1") 
c.execute("CREATE TABLE table1 (col1 text, col2 text, col3 text)") 
with open("file1") as f: 
    for line in f: 
     c.execute("INSERT INTO table1 VALUES (?, ?, ?)", line.strip().split()) 

# create table2 
c.execute("DROP TABLE table2") 
c.execute("CREATE TABLE table2 (col1 text, col2 text, col3 text, col4 text)") 
with open("file2") as f: 
    for line in f: 
     c.execute("INSERT INTO table2 VALUES (?, ?, ?, ?)", 
      line.strip().split('|')) 

# make changes persistent 
conn.commit() 

# retrieve desired data and write it to file 
with open("file3", "w+") as f: 
    for x in c.execute(
     """ 
     SELECT table2.col1 
      , table2.col2 
      , table2.col3 
      , table2.col4 
      , table1.col1 
     FROM table1 JOIN table2 ON table1.col2 = table2.col2 
     """): 
     f.write("%s\n" % "|".join(x)) 

# close connection 
conn.close() 

輸出文件應該是這樣的:

paralle|dog|numbering|position|23 
parallel|cat|nuucers|position|24 
相關問題