0
我是新來編碼,並有大量的大數據來處理。目前,我試圖合併26個tsv
文件(每個有兩列沒有標題,一個是contig _number
另一種是計數。熊貓'外'合併多個csvs使用太多內存
如果tsv
沒有特定contig_number
計數,它不具有行 - 所以我試圖使用how ='outer',然後用0填充缺失的值。
我已成功地爲tsvs
,我已將子集化爲運行初始測試,但是當我運行腳本時在實際的數據,這是大的(約40,000行,兩列),越來越多的內存使用... ...
我得到到服務器上500Gb的RAM,並稱它爲一天。
這是一個成功的子集的CSV代碼:
files = glob.glob('*_count.tsv')
data_frames = []
logging.info("Reading in sample files and adding to list")
for fp in files:
# read in the files and put them into dataframes
df = pd.read_csv(fp, sep = '\t', header = None, index_col = 0)
# rename the columns so we know what file they came from
df = df.rename(columns = {1:str(fp)}).reset_index()
df = df.rename(columns = {0:"contig"})
# append the dataframes to a list
data_frames.append(df)
logging.info("Merging the tables on contig, and fill in samples with no counts for contigs")
# merge the tables on gene_id and select how = 'outer' which will include all rows but will leave empty space where there is no data
df=reduce(lambda left,right: pd.merge(left, right, how='outer', on="contig"), data_frames)
# this bit is important to fill missing data with a 0
df.fillna(0, inplace = True)
logging.info("Writing concatenated count table to file")
# write the dataframe to file
df.to_csv("combined_bamm_filter_count_file.tsv",
sep='\t', index=False, header=True)
我將不勝感激任何意見或建議!也許有太多東西要記住,我應該嘗試別的東西。
謝謝!
Hello Victor,謝謝!在嘗試pd.concat的過程中,我意識到一些文件解析錯了,並且有重複的索引。我會解決這個問題,並給concat一個適當的去! – Caitlin