有效的方法來驗證記錄在Python是唯一/ PyTables

我在PyTables與約50萬條記錄的表。兩個字段（特別是用戶ID和日期）的組合應該是唯一的（即用戶每天最多隻能有一條記錄），但我需要驗證確實如此。有效的方法來驗證記錄在Python是唯一/ PyTables

例證，我的表是這樣的：

userID | date 
A  | 1 
A  | 2 
B  | 1 
B  | 2 
B  | 2 <- bad! Problem with the data!

其他細節：

表目前 '主要是' 排序。
我可以勉強拉一列作爲一個numpy陣列，但我不能拉兩個內存在同時。
無論用戶ID和日期是整數

來源

2009-08-22 nazca

似乎PyTable中的索引僅限於單列。

我建議增加一個哈希列，並把一個指標就可以了。您的唯一數據被定義爲數據庫中其他列的串聯。分隔符將確保沒有兩個不同的行產生相同的唯一數據。哈希列可以是這個唯一的字符串，但是如果您的數據很長，您將需要使用哈希函數。像md5或sha1這樣的快速散列函數對於這個應用程序來說非常適合。

計算散列數據，並檢查它是否在DB。如果是這樣，你知道你打了一些重複的數據。如果沒有，你可以安全地添加它。

來源

2009-08-22 16:36:40

我不知道很多關於PyTables，但我會嘗試這種方法

對於每個用戶ID，讓所有(userID, date)對
assert len(rows)==len(set(rows)) - 這如果rows列表中包含的所有(userID, date)元組都是唯一的，則斷言成立

來源

2009-08-22 14:05:16

年後，我仍然有同樣的問題，但索引和查詢這個問題的力量只是稍微痛苦，這取決於你的表的大小。通過使用readWhere，或getListWhere的，我認爲這個問題是大約爲O（n）

這裏是我做過什麼...... 1.我創建了一個表，有兩個indicies..you 可以使用多個indicies在PyTables：

http://pytables.github.com/usersguide/optimization.html#indexed-searches

一旦你的表是indexed，我也用LZO壓縮你可以做到以下幾點：

import tables 
h5f = tables.openFile('filename.h5') 
tbl = h5f.getNode('/data','data_table') # assumes group data and table data_table 
counter += 0 

for row in tbl: 
    ts = row['date'] # timestamp (ts) or date 
    uid = row['userID'] 
    query = '(date == %d) & (userID == "%s")' % (ts, uid) 
    result = tbl.readWhere(query) 
    if len(result) > 1: 
     # Do something here 
     pass 
    counter += 1 
    if counter % 1000 == 0: print '%d rows processed'

現在我在這裏寫的代碼實際上很慢。我確信有一些PyTables專家可以給你一個更好的答案。但這裏是我的想法對性能：

如果你知道你開始用乾淨的數據，即（不重複），那麼所有你需要做的就是查詢表，一旦你有興趣在尋找的鑰匙，這意味着你只需要做到：

ts = row['date'] # timestamp (ts) or date 
uid = row['userID'] 
query = '(date == %d) & (userID == "%s")' % (ts, uid) 
result = tbl.getListWhere(query) 
if len(result) == 0: 
    # key pair is not in table 
    # do what you were going to do 
    pass 
elif len(result) > 1: 
    # Do something here, like get a handle to the row and update instead of append. 
    pass

如果你有大量的時間去重複檢查已經創建了一個後臺進程，用您的文件搜索目錄並搜索重複項。

我希望這可以幫助別人。

來源

2012-07-03 00:07:29

有效的方法來驗證記錄在Python是唯一/ PyTables

回答

相關問題