熊貓'外'合併多個csvs使用太多內存

我是新來編碼，並有大量的大數據來處理。目前，我試圖合併26個tsv文件（每個有兩列沒有標題，一個是contig _number另一種是計數。熊貓'外'合併多個csvs使用太多內存

如果tsv沒有特定contig_number計數，它不具有行 - 所以我試圖使用how ='outer'，然後用0填充缺失的值。

我已成功地爲tsvs，我已將子集化爲運行初始測試，但是當我運行腳本時在實際的數據，這是大的（約40,000行，兩列），越來越多的內存使用... ...

我得到到服務器上500Gb的RAM，並稱它爲一天。

這是一個成功的子集的CSV代碼：

files = glob.glob('*_count.tsv') 
data_frames = [] 
logging.info("Reading in sample files and adding to list") 
for fp in files: 
    # read in the files and put them into dataframes 
    df = pd.read_csv(fp, sep = '\t', header = None, index_col = 0) 
    # rename the columns so we know what file they came from 
    df = df.rename(columns = {1:str(fp)}).reset_index() 
    df = df.rename(columns = {0:"contig"}) 
    # append the dataframes to a list 
    data_frames.append(df) 

logging.info("Merging the tables on contig, and fill in samples with no counts for contigs") 

# merge the tables on gene_id and select how = 'outer' which will include all rows but will leave empty space where there is no data 
df=reduce(lambda left,right: pd.merge(left, right, how='outer', on="contig"), data_frames) 

# this bit is important to fill missing data with a 0 
df.fillna(0, inplace = True) 

logging.info("Writing concatenated count table to file") 

# write the dataframe to file 
df.to_csv("combined_bamm_filter_count_file.tsv", 
        sep='\t', index=False, header=True)

我將不勝感激任何意見或建議！也許有太多東西要記住，我應該嘗試別的東西。

謝謝！

來源

2017-08-26 Caitlin

我通常會用pd.concat做這些類型的操作。我不知道它爲什麼更高效的確切細節，但熊貓對於合併指數有一些優化。

我會做

for fp in files: 
    # read in the files and put them into dataframes 
    df = pd.read_csv(fp, sep = '\t', header = None, index_col = 0) 
    # rename the columns so we know what file they came from 
    df = df.rename(columns = {1:str(fp)}) 
    #just keep the contig as the index 
    data_frames.append(df) 

df_full=pd.concat(data_frames,axis=1)

然後df_full=df_full.fillna(0)，如果你想。

事實上，由於每個文件只有一列（+索引），您可能會做得更好，將它們視爲Series而不是DataFrame。

來源

2017-08-26 04:01:44

Hello Victor，謝謝！在嘗試pd.concat的過程中，我意識到一些文件解析錯了，並且有重複的索引。我會解決這個問題，並給concat一個適當的去！ – Caitlin

熊貓'外'合併多個csvs使用太多內存

回答

相關問題