在python中過濾數據

我正在研究python的web爬蟲，它收集站點上用戶發佈的帖子信息，並比較他們提供的所有用戶參與的帖子的分數。目前它的結構使我能夠收到以下數據：在python中過濾數據

results是一個用戶名索引的字典，其中包含每個用戶歷史記錄的post,points鍵值結構的字典。

common是以結果中第一位用戶的字典中的所有帖子開頭的列表。此列表應過濾到只有所有用戶有共同點的帖子

points是一個用戶名索引的字典，它保持共享帖子上的點數總數。

我的過濾代碼如下：

common = list(results.values()[0].keys()) 

for user in results: 
    for post_hash in common: 
     if post_hash not in results[user]: 
      common.remove(post_hash) 
     else: 
      points[user] += results[user][post_hash]

我現在遇到的問題是，這實際上並沒有過濾掉不共享，因此，不提供精確的點值的帖子。

我在做什麼錯我的結構，有沒有更簡單的方法來找到常見的職位？

來源

2016-04-14 David Renick

你可以發表你的數據結構的副本，我們來看看？只有兩個用戶+帖子和分數的小樣本會很好。 – miah

我認爲你可能有兩個問題：

使用的common名單意味着，當您通過common.remove刪除項目，它只會刪除它發現的第一個項目（可能有更多）
你不只是增加了對所有用戶共享的帖子點 - 如你遇到他們，你要添加點的用戶 - 你知道，如果那個崗位是大家共享之前或不

沒有一些實際數據玩，有點d ifficult寫工作代碼，但試試這個：

# this should give us a list of posts shared by all users 
common = set.intersection(*[set(k.keys()) for k in results.values()]) 

# there's probably a more efficient (functional) way of summing the points 
# by user instead of looping, but simple is good. 
for user in results: 
    for post_hash in common: 
     points[user] += results[user][post_hash]

來源

2016-04-14 04:37:02 Gerrat

這很好用！使用通用列表不應該是一個問題，因爲當我最初處理每個用戶的數據時，我總計每個帖子的分數，但是設置的交叉點效果更好，效率更高。謝謝！ –

@Neomang是的，它可以工作，但它不是很有效。在評估「通用」時有一個列表組件。 *所有*你的帖子將無論如何都會同時在內存中。 –

import functools 

iterable = (v.keys() for v in results.values()) 
common = funtools.reduce(lambda x,y: x & y, iterable) 
points = {user: sum(posts[post] for post in common) for user,posts in results.items()}

看看是否有效。

來源

2016-04-14 04:12:24

不幸的是，沒有。當我使用它時，出現以下錯誤： 'common = list（fn.reduce（op.or_，iterable）） TypeError：不受支持的操作數類型爲|：'list'和'list'' –

@ Neomang剛編輯它。如果它有效，它將非常有效，因爲沒有中間數據結構。 –

from collections import Counter 
from functools import reduce 
posts = [] 
# Create an array of all the post hashes 
for p in results.values(): 
    posts.extend(p.keys()) 

# use Counter to create a dictionary like object that where the key 
# is the post hash and the value is the number of occurrences 
posts = Counter(posts) 
for user in results: 
    # Reduce only the posts that show up more than once. 
    points[user] = reduce(lambda x,y: x+y, (post for post in user if posts[post] > 1))

來源

2016-04-14 04:41:07 miah

在python中過濾數據

回答

相關問題