2016-05-18 132 views
1

雖然通過Python中用於記錄重複數據刪除的Dedupe庫的例子,我發現它在輸出文件中創建了一個羣集ID列,根據文檔指出哪些記錄參考對彼此。雖然我無法找出羣集ID之間的任何關係,這是如何幫助查找重複記錄。如果有人對此有所瞭解,請向我解釋這一點。這是重複數據刪除的代碼。Python中的重複數據刪除

# This can run either as a python2 or python3 code 
from future.builtins import next 

import os 
import csv 
import re 
import logging 
import optparse 

import dedupe 
from unidecode import unidecode 


input_file = 'data/csv_example_input_with_true_ids.csv' 
output_file = 'data/csv_example_output1.csv' 
settings_file = 'data/csv_example_learned_settings' 
training_file = 'data/csv_example_training.json' 

# Clean or process the data 


def preProcess(column): 

    try: 
     column = column.decode('utf-8') 
    except AttributeError: 
     pass 
    column = unidecode(column) 
    column = re.sub(' +', ' ', column) 
    column = re.sub('\n', ' ', column) 
    column = column.strip().strip('"').strip("'").lower().strip() 

    if not column: 
     column = None 
    return column 


# Read in the data from CSV file: 


def readData(filename): 

    data_d = {} 
    with open(filename) as f: 
     reader = csv.DictReader(f) 
     for row in reader: 
      clean_row = [(k, preProcess(v)) for (k, v) in row.items()] 
      row_id = int(row['Id']) 
      data_d[row_id] = dict(clean_row) 

    return data_d 

print('importing data ...') 
data_d = readData(input_file) 

if os.path.exists(settings_file): 
    print('reading from', settings_file) 
    with open(settings_file, 'rb') as f: 
     deduper = dedupe.StaticDedupe(f) 
else: 
    fields = [ 
     {'field' : 'Site name', 'type': 'String'}, 
     {'field' : 'Address', 'type': 'String'}, 
     {'field' : 'Zip', 'type': 'Exact', 'has missing' : True}, 
     {'field' : 'Phone', 'type': 'String', 'has missing' : True}, 
     ] 
    deduper = dedupe.Dedupe(fields) 
    deduper.sample(data_d, 15000) 

    if os.path.exists(training_file): 
     print('reading labeled examples from ', training_file) 
     with open(training_file, 'rb') as f: 
      deduper.readTraining(f) 

    print('starting active labeling...') 

    dedupe.consoleLabel(deduper) 

    deduper.train() 

    with open(training_file, 'w') as tf: 
     deduper.writeTraining(tf) 

    with open(settings_file, 'wb') as sf: 
     deduper.writeSettings(sf) 

threshold = deduper.threshold(data_d, recall_weight=1) 

print('clustering...') 
clustered_dupes = deduper.match(data_d, threshold) 

print('# duplicate sets', len(clustered_dupes)) 


cluster_membership = {} 
cluster_id = 0 
for (cluster_id, cluster) in enumerate(clustered_dupes): 
    id_set, scores = cluster 
    cluster_d = [data_d[c] for c in id_set] 
    canonical_rep = dedupe.canonicalize(cluster_d) 
    for record_id, score in zip(id_set, scores): 
     cluster_membership[record_id] = { 
      "cluster id" : cluster_id, 
      "canonical representation" : canonical_rep, 
      "confidence": score 
     } 

singleton_id = cluster_id + 1 

with open(output_file, 'w') as f_output, open(input_file) as f_input: 
    writer = csv.writer(f_output) 
    reader = csv.reader(f_input) 

    heading_row = next(reader) 
    heading_row.insert(0, 'confidence_score') 
    heading_row.insert(0, 'Cluster ID') 
    canonical_keys = canonical_rep.keys() 
    for key in canonical_keys: 
     heading_row.append('canonical_' + key) 

    writer.writerow(heading_row) 

    for row in reader: 
     row_id = int(row[0]) 
     if row_id in cluster_membership: 
      cluster_id = cluster_membership[row_id]["cluster id"] 
      canonical_rep = cluster_membership[row_id]["canonical representation"] 
      row.insert(0, cluster_membership[row_id]['confidence']) 
      row.insert(0, cluster_id) 
      for key in canonical_keys: 
       row.append(canonical_rep[key].encode('utf8')) 
     else: 
      row.insert(0, None) 
      row.insert(0, singleton_id) 
      singleton_id += 1 
      for key in canonical_keys: 
       row.append(None) 
     writer.writerow(row) 

在此先感謝

回答

1

你說的沒錯,在Cluster ID不用於任何東西。

您應該看看Cluster ID作爲重複數據刪除執行的輸出。 Dedupe對合並您的記錄不感興趣。它的核心重點是試圖確定記錄是可能類似。

它通過分配它認爲與Cluster ID相同的行來執行此操作。

這是您作爲軟件工程師的工作,然後以智能方式使用該數據並決定如何合併該數據(如果有的話)。

如果我輸入如下:

enter image description here

我的輸出將類似如下:

enter image description here

所以,請記住,記錄你輸入的數量應該總是匹配重複記錄的輸出數量。區別僅在於您有一個新的「羣集ID」列,您現在可以使用它來對可能的重複項進行「分組」。