2017-05-08 58 views
0

我在一個文件夾中有80個csv文件,並且想要比較每個文件的第一列(我的文件中沒有標題)與所有其他第一列其他文件(無需重複例如fileA,FileB到FileB,FileA) 因此,此列可能包含數千行,每行中有一個用戶名。而目標是輸出一個新的CSV文件是這樣的:比較python文件夾中所有csv文件之間的一列,並輸出結果

output.csv:

fileA,fileB,3,'James'-'samuel'-'Gregg' 

fileA,filec,5,'Gregg'-'Traba'-'foo' 

於是我開始嘗試解開它,但我被困在無限for循環:

import csv as csv 
output = open('output.csv', 'wb') 
writer = csv.writer(output) 
list_file = ['fileA.csv', 'fileB', 'fileC.csv', 'fileD.csv', 'fileE.csv'] 
for file1 in list_files: 
    csv_obj = csv.reader(open(file1, 'rb')) 
    for file2 in list_files: 
     csv_obj2 = csv.reader(open(file2, 'rb')) 
     for line in csv_obj: 
      for line1 in csv_obj2: 
       if line == line2 .... 

在這一點上,我不明白什麼可以用來避免這些無休止的循環!?我應該用什麼來代替?

更新

樣品CSV文件:

file1.csv:

7627012826,jamesGam,followers,623,370,5,293,Tue 
2955713991,samRichard,followers,3769,3383,45,170,Wed 
250898317,CamalSarj,followers,1352,2365,111,10954,Sat 
928898317,JangiBell,followers,9152,2365,731,74954,Sat 

file2.csv:

118898359,JangiBell,followers,73152,9815,381,177954,Sat 
9227010126,jorgebel,followers,7223,37550,5,9193,Sat 
1105742991,samRichard,followers,7609,8283,985,285,Wed 
623898922,Estovagre,followers,956,8393,921,1981,Tue 

輸出將在output.csv採用這種格式:

file1,file2,2,'samRichard'-'JangiBell' 
+0

你能解釋'output.csv'的格式嗎? –

+0

output.csv文件的格式是條目,每行/條目是兩個文件名,後面是找到的總共同用戶名,後面是在兩個文件中找到的所有常見用戶名,例如:fileA,fileb,2,'Jame'' Sal' –

+0

@Joesal,這看起來對大熊貓來說是一件容易的事情......你能否以CSV格式提供一個小的可再現樣本數據集(例如2個CSV文件,每列2-3列和3行)以及期望的結果CSV ,所以我們可以確切地看到你想達到什麼目的? – MaxU

回答

1

我只使用熊貓閱讀和書寫csv。在我看來,所需邏輯的主要部分是交叉點(以獲得共同名稱)和成對匹配。

import csv 
import pandas as pd 

files = ['file1.csv', 'file2.csv'] # use os.listdir here if you want 

usernames = {} 
output = [] 

# load the username column that you're interested in 
# into a dict 
# keys are the filenames; 
# values are the usernames, but as a set 

for f in files: 
    df = pd.read_csv(f, header=None) 
    usernames[f] = set(df[1].values) # second column, as in your sample csvs 

# two loops for pairwise matching 
for (i, file_i) in enumerate(sorted(usernames)): 
    for (j, file_j) in enumerate(sorted(usernames)): 
     # prevent recalculating a pair 
     if j > i: 

      # set intersection 
      intersect = usernames[file_i] & usernames[file_j] 

      # just getting the custom string format you wanted: 
      # single-quoted names, joined by hyphens 
      formatted_items = ["'{}'".format(item) for item in list(intersect)] 
      formatted_string = '-'.join(formatted_items) 

      # write new row of output 
      newrow = [file_i[:-4],  # take out .csv extension from string 
         file_j[:-4], 
         len(intersect), # score (names in common) 
         formatted_string] 
      output.append(newrow) 

# output csv 
pd.DataFrame(output).to_csv('output.csv', index=None, 
          header=None, quoting=csv.QUOTE_ALL) 
0

我不知道,我必須瞭解你要作爲輸出什麼,但是:

import csv 
list_file = ['fileA.csv', 'fileB.csv', 'fileC.csv', 'fileD.csv', 'fileE.csv'] 
for i in range(len(list_file)): 
    reader_i = csv.reader(open(list_file[i], 'rb')) 
    #to eliminate repeted elements i start the second loop from i 
    for j in range(i, len(list_file)): 
     reader_i = csv.reader(open(list_file[j], 'rb')) 
     for line_i, line_j in zip(reader_i, reader_j): 
      if line_i[0]==line_j[0]: 
       ... 

我希望我幫助你。好工作。

相關問題