2016-11-23 16 views
2

我在熊貓中有兩個數據框。我想合併這兩個數據框,但我一直運行到內存錯誤。什麼是我可以使用的工作?MemoryError將兩個數據框與大熊貓和大腦合併---我怎麼能做到這一點?

下面是設置:

import pandas as pd 

df1 = pd.read_cvs("first1.csv") 
df2 = pd.read_csv("second2.csv") 
print(df1.shape) # output: (4757076, 4) 
print(df2.shape) # output: (428764, 45) 


df1.head 

    column1 begin end category 
0 class1 10001 10468 third 
1 class1 10469 11447  third 
2 class1 11505 11675  fourth 
3 class2 15265 15355 seventh 
4 class2 15798 15849 second 


print(df2.shape) # (428764, 45) 
    column1 begin .... 
0 class1 10524 .... 
1 class1 10541 .... 
2 class1 10549 .... 
3 class1 10565 ... 
4 class1 10596 ... 

我只是想在「列1」這兩個DataFrames合併。但是,這總是會導致內存錯誤。

讓我們試試這個在大熊貓第一,在系統上有大約2 TB的RAM和數以百計的線程:

import pandas as pd 
df1 = pd.read_cvs("first1.csv") 
df2 = pd.read_csv("second2.csv") 
merged = pd.merge(df1, df2, on="column1", how="outer", suffixes=("","_repeated") 

這裏的錯誤,我得到:

Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
    File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 39, in merge 
    return op.get_result() 
    File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 217, in get_result 
    join_index, left_indexer, right_indexer = self._get_join_info() 
    File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 353, in _get_join_info 
    sort=self.sort, how=self.how) 
    File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 559, in _get_join_indexers 
    return join_func(lkey, rkey, count, **kwargs) 
    File "pandas/src/join.pyx", line 160, in pandas.algos.full_outer_join (pandas/algos.c:61256) 
MemoryError 

That didn't work. Let's try with dask: 


import pandas as pd 
import dask.dataframe as dd 
from numpy import nan 


ddf1 = dd.from_pandas(df1, npartitions=2) 
ddf2 = dd.from_pandas(df2, npartitions=2) 

merged = dd.merge(ddf1, ddf2, on="column1", how="outer", suffixes=("","_repeat")).compute(num_workers=60) 

Here's the error I get: 

Traceback (most recent call last): 
    File "repeat_finder.py", line 15, in <module> 
    merged = dd.merge(ddf1, ddf2,on="column1", how="outer", suffixes=("","_repeat")).compute(num_workers=60) 
    File "/path/python3.5/site-packages/dask/base.py", line 78, in compute 
    return compute(self, **kwargs)[0] 
    File "/path/python3.5/site-packages/dask/base.py", line 178, in compute 
    results = get(dsk, keys, **kwargs) 
    File "/path/python3.5/site-packages/dask/threaded.py", line 69, in get 
    **kwargs) 
    File "/path/python3.5/site-packages/dask/async.py", line 502, in get_async 
    raise(remote_exception(res, tb)) 
dask.async.MemoryError: 

Traceback 
--------- 
    File "/path/python3.5/site-packages/dask/async.py", line 268, in execute_task 
    result = _execute_task(task, data) 
    File "/path/python3.5/site-packages/dask/async.py", line 249, in _execute_task 
    return func(*args2) 
    File "/path/python3.5/site-packages/dask/dataframe/methods.py", line 221, in merge 
    suffixes=suffixes, indicator=indicator) 
    File "/path/python3.5/site-packages/pandas/tools/merge.py", line 59, in merge 
    return op.get_result() 
    File "/path/python3.5/site-packages/pandas/tools/merge.py", line 503, in get_result 
    join_index, left_indexer, right_indexer = self._get_join_info() 
    File "/path/python3.5/site-packages/pandas/tools/merge.py", line 667, in _get_join_info 
    right_indexer) = self._get_join_indexers() 
    File "/path/python3.5/site-packages/pandas/tools/merge.py", line 647, in _get_join_indexers 
    how=self.how) 
    File "/path/python3.5/site-packages/pandas/tools/merge.py", line 876, in _get_join_indexers 
    return join_func(lkey, rkey, count, **kwargs) 
    File "pandas/src/join.pyx", line 226, in pandas._join.full_outer_join (pandas/src/join.c:11286) 
    File "pandas/src/join.pyx", line 231, in pandas._join._get_result_indexer (pandas/src/join.c:11474) 
    File "path/python3.5/site-packages/pandas/core/algorithms.py", line 1072, in take_nd 
    out = np.empty(out_shape, dtype=dtype, order='F') 

我怎麼能得到這個工作,即使它無恥地低效率?

編輯:爲了響應合併兩列/索引的建議,我不認爲我可以做到這一點。這裏是我試圖運行的代碼:

import pandas as pd 
import dask.dataframe as dd 

df1 = pd.read_cvs("first1.csv") 
df2 = pd.read_csv("second2.csv") 

ddf1 = dd.from_pandas(df1, npartitions=2) 
ddf2 = dd.from_pandas(df2, npartitions=2) 

merged = dd.merge(ddf1, ddf2, on="column1", how="outer", suffixes=("","_repeat")).compute(num_workers=60) 
merged = merged[(ddf1.column1 == row.column1) & (ddf2.begin >= ddf1.begin) & (ddf2.begin <= ddf1.end)] 
merged = dd.merge(ddf2, merged, on = ["column1"]).compute(num_workers=60) 
merged.to_csv("output.csv", index=False) 
+1

「和數百個線程的內存約爲2 TB」 - wowsers。首先,你在Linux上?如果是這樣,請檢查該任務的ulimit和or rlimit。 –

+0

@BrianCain好主意。儘管如此---我怎麼能做到這一點? :)這些數據幀不是*那*大 – EB2127

+0

好吧...看完你的編輯後,你的方法出現錯誤,恕我直言。請解釋你打算做什麼。看起來好像你想把'merged'剪輯成一組特定的行。 '行'中有什麼?我認爲你可以用更簡單的方式解決這個問題。 – Kartik

回答

-1

你不能只是合併這兩個數據幀只column1,爲column1不是在任一數據幀中的每個實例的唯一標識符。嘗試:

merged = pd.merge(df1, df2, on=["column1", "begin"], how="outer", suffixes=("","_repeated")) 

如果你也有df2end列,你可能可能需要嘗試:

merged = pd.merge(df1, df2, on=["column1", "begin", "end"], how="outer", suffixes=("","_repeated")) 
+0

這不回答OP的問題。 OP希望''column1''上有一個外連接,並且正在得到一個'MemoryError'。 '「column1」'是非唯一的並不關心合併或MemoryError。 OP可能沒有足夠的資源調度服務器上的任務。 – Kartik

+0

根據我自己的經驗,我在合併數據框時遇到了類似的「MemoryError」問題。僅當數據大小不太大時,不唯一的'column1'不會導致'MemoryError'。考慮到問題中發佈的示例數據框,如果僅合併到「column1」,合併數據框的大小可能會呈指數級增長,這很可能會導致內存錯誤。我認爲在這種情況下合併多列,而不是'column1',可能更合理。 – mikeqfu

+0

是的,OP在2TB RAM系統上...... OP處理的幀最多會產生一個5185840 x 49的幀。與2 TB相比,這並不算什麼。我的猜測是,對於一個裸骨頭的操作系統,數據可以在4GB的機器上合併。輕鬆在一個8 GB的機器... – Kartik

相關問題