我正在python/django中構建一個網站,並且想要預測用戶提交是否有效或它是垃圾郵件。計算垃圾郵件概率
用戶對他們提交的內容具有接受率,就像本網站一樣。
用戶可以調節其他用戶的提交;並且這些調整由管理員稍後進行元代碼管理。
有鑑於此:
- 與提交註冊用戶A接受的60%的速度提交的東西。
- 用戶B將A的帖子作爲有效提交。但是,用戶B在70%的時間內錯了。
- 用戶C將A的帖子視爲垃圾郵件。用戶C通常是正確的。如果用戶C說有東西是垃圾郵件/沒有垃圾郵件,那麼80%的時間都是正確的。
我該如何預測A發佈垃圾郵件的機會?
編輯:我做了一個python腳本模擬這樣的場景:
#!/usr/bin/env python
import random
def submit(p):
"""Return 'ham' with (p*100)% probability"""
return 'ham' if random.random() < p else 'spam'
def moderate(p, ham_or_spam):
"""Moderate ham as ham and spam as spam with (p*100)% probability"""
if ham_or_spam == 'spam':
return 'spam' if random.random() < p else 'ham'
if ham_or_spam == 'ham':
return 'ham' if random.random() < p else 'spam'
NUMBER_OF_SUBMISSIONS = 100000
USER_A_HAM_RATIO = 0.6 # Will submit 60% ham
USER_B_PRECISION = 0.3 # Will moderate a submission correctly 30% of the time
USER_C_PRECISION = 0.8 # Will moderate a submission correctly 80% of the time
user_a_submissions = [submit(USER_A_HAM_RATIO) \
for i in xrange(NUMBER_OF_SUBMISSIONS)]
print "User A has made %d submissions. %d of them are 'ham'." \
% (len(user_a_submissions), user_a_submissions.count('ham'))
user_b_moderations = [ moderate(USER_B_PRECISION, ham_or_spam) \
for ham_or_spam in user_a_submissions]
user_b_moderations_which_are_correct = \
[i for i, j in zip(user_a_submissions, user_b_moderations) if i == j]
print "User B has correctly moderated %d submissions." % \
len(user_b_moderations_which_are_correct)
user_c_moderations = [ moderate(USER_C_PRECISION, ham_or_spam) \
for ham_or_spam in user_a_submissions]
user_c_moderations_which_are_correct = \
[i for i, j in zip(user_a_submissions, user_c_moderations) if i == j]
print "User C has correctly moderated %d submissions." % \
len(user_c_moderations_which_are_correct)
i = 0
j = 0
k = 0
for a, b, c in zip(user_a_submissions, user_b_moderations, user_c_moderations):
if b == 'spam' and c == 'ham':
i += 1
if a == 'spam':
j += 1
elif a == "ham":
k += 1
print "'spam' was identified as 'spam' by user B and 'ham' by user C %d times." % j
print "'ham' was identified as 'spam' by user B and 'ham' by user C %d times." % k
print "If user B says it's spam and user C says it's ham, it will be spam \
%.2f percent of the time, and ham %.2f percent of the time." % \
(float(j)/i*100, float(k)/i*100)
運行腳本給我這樣的輸出:
- 用戶A取得了10萬份意見書。其中60194個是「火腿」。
- 用戶B已正確主持了29864次提交。
- 用戶C已正確主持了79990個提交。
- 「垃圾郵件」被用戶B識別爲「垃圾郵件」,用戶C識別爲「火腿」2346次。
- 'ham'被用戶B和'ham'識別爲'垃圾',用戶C 33634次。
- 如果用戶B說它是垃圾郵件,而用戶C說這是垃圾郵件,那麼它將佔垃圾郵件總數的6.52%,並佔用93.48%的時間。
這裏的概率是否合理?這是模擬場景的正確方法嗎?
14.4%它不是垃圾郵件;-) – Boldewyn 2010-06-07 15:12:52