1
卡住以下。read_hdf失敗'所有變量refrences必須是一個軸的參考...'
log_iter = pd.read_hdf(FN, dspath,
where = [pd.Term('hashID','=',idList)],
iterator=True,
chunksize=3000)
dspath有35列,可能會導致MemoryError相當大。
所以試圖去迭代器/ chunksize路線。但'where ='子句失敗
ValueError: The passed where expression: [hashID=[147685,...,147197]]
contains an invalid variable reference
all of the variable refrences must be a reference to
an axis (e.g. 'index' or 'columns'), or a data_column
The currently defined references are: ** list of column names **
問題是hashID不在列名列表中。然而,如果我做
read_hdf(FN, dspath).columns
哈希ID在列中。有什麼建議麼?我的目標是讀取hashID在idList中的所有行x 35列。
更新。下面的工作,並顯示了hashID存在爲一列,一旦數據集被讀取。
def dsIterator(self, q, idList):
hID = u'hashID'
FN = self.db._hdf_FN()
dspath = self.getdatasetname(q)
log_iter = pd.read_hdf(FN, dspath,
#where = [pd.Term(u'logid_hashID','=',idList)],
iterator=True,
chunksize=30000)
n_all = 0
retDF = None
for dfChunk in log_iter:
goodChunk = dfChunk.loc[dfChunk[hID].isin(idList)]
if retDF is None : retDF = goodChunk
else:
retDF = pd.concat([retDF, goodChunk], ignore_index=True)
n_all += dfChunk[hID].count()
n_ret = retDF[hID].count()
return retDF
請注意,我正在使用python2。因此'hashID'必須使用u'hashID'作爲列名。 – frankr6591