2017-08-26 62 views
0

我是新來編碼,並有大量的大數據來處理。目前,我試圖合併26個tsv文件(每個有兩列沒有標題,一個是contig _number另一種是計數。熊貓'外'合併多個csvs使用太多內存

如果tsv沒有特定contig_number計數,它不具有行 - 所以我試圖使用how ='outer',然後用0填充缺失的值。

我已成功地爲tsvs,我已將子集化爲運行初始測試,但是當我運行腳本時在實際的數據,這是大的(約40,000行,兩列),越來越多的內存使用... ...

我得到到服務器上500Gb的RAM,並稱它爲一天。

這是一個成功的子集的CSV代碼:

files = glob.glob('*_count.tsv') 
data_frames = [] 
logging.info("Reading in sample files and adding to list") 
for fp in files: 
    # read in the files and put them into dataframes 
    df = pd.read_csv(fp, sep = '\t', header = None, index_col = 0) 
    # rename the columns so we know what file they came from 
    df = df.rename(columns = {1:str(fp)}).reset_index() 
    df = df.rename(columns = {0:"contig"}) 
    # append the dataframes to a list 
    data_frames.append(df) 

logging.info("Merging the tables on contig, and fill in samples with no counts for contigs") 

# merge the tables on gene_id and select how = 'outer' which will include all rows but will leave empty space where there is no data 
df=reduce(lambda left,right: pd.merge(left, right, how='outer', on="contig"), data_frames) 

# this bit is important to fill missing data with a 0 
df.fillna(0, inplace = True) 

logging.info("Writing concatenated count table to file") 

# write the dataframe to file 
df.to_csv("combined_bamm_filter_count_file.tsv", 
        sep='\t', index=False, header=True) 

我將不勝感激任何意見或建議!也許有太多東西要記住,我應該嘗試別的東西。

謝謝!

回答

0

我通常會用pd.concat做這些類型的操作。我不知道它爲什麼更高效的確切細節,但熊貓對於合併指數有一些優化。

我會做

for fp in files: 
    # read in the files and put them into dataframes 
    df = pd.read_csv(fp, sep = '\t', header = None, index_col = 0) 
    # rename the columns so we know what file they came from 
    df = df.rename(columns = {1:str(fp)}) 
    #just keep the contig as the index 
    data_frames.append(df) 

df_full=pd.concat(data_frames,axis=1) 

然後df_full=df_full.fillna(0),如果你想。

事實上,由於每個文件只有一列(+索引),您可能會做得更好,將它們視爲Series而不是DataFrame

+0

Hello Victor,謝謝!在嘗試pd.concat的過程中,我意識到一些文件解析錯了,並且有重複的索引。我會解決這個問題,並給concat一個適當的去! – Caitlin