鮮明的詞情感分析

我嘗試做一個基於7000字的冠詞的情感分析。代碼在Python中工作，但它選擇所有的組合而不是不同的單詞。鮮明的詞情感分析

例如，字典中說enter和文字說enterprise。我怎樣才能改變它沒有看到這個匹配的代碼？

dictfile = sys.argv[1] 
textfile = sys.argv[2] 

a = open(textfile) 
text = string.split(a.read()) 
a.close() 

a = open(dictfile) 
lines = a.readlines() 
a.close() 

dic = {} 
scores = {} 

current_category = "Default" 
scores[current_category] = 0 

for line in lines: 
    if line[0:2] == '>>': 
     current_category = string.strip(line[2:]) 
     scores[current_category] = 0 
    else: 
     line = line.strip() 
     if len(line) > 0: 
      pattern = re.compile(line, re.IGNORECASE) 
      dic[pattern] = current_category 

for token in text: 
    for pattern in dic.keys(): 
     if pattern.match(token): 
      categ = dic[pattern] 
      scores[categ] = scores[categ] + 1 

for key in scores.keys(): 
    print key, ":", scores[key]

來源

2016-12-06 Guido

如果你的字典有* *的話，爲什麼重新使用呢？爲什麼不'如果行==令牌？ –

謝謝你的親友Robin Koch。問題在於字典來自一個單獨的文件。我們不能在文件中包含分離的術語，我們正在測量情緒。我們沒有做一個字數。預先感謝您 – Guido

我仍然不確定您與什麼相匹配。你能提供一些例子嗎？如果你真的做了're.compile（'enter'）。match（'entprise'）'，你不需要正則表達式。如果你的字典實際上包含正則表達式，那麼你應該把它添加到問題中。 –

.match()匹配從行的開頭。所以，你可以使用錨線的一端在REG例如：

re.compile(line + '$')

或者你可以使用單詞邊界：

re.compile('\b' + line + '\b')

來源

2016-12-06 13:09:10

你的縮進是語無倫次。有些級別使用3個空格，有些使用4個空格。
您嘗試將字詞上的每個單詞與字典中的所有7000個單詞進行匹配。相反，只需查看字典中的單詞即可。如果不在那裏，請忽略錯誤（EAFP原則）。
此外，我不確定在對象方法（"".split()）上是否有使用類方法（string.split()）的優勢。
Python也有一個defaultdict它自己初始化字典爲0。

編輯：

代替.readlines()我使用.read()和.split('\n')。這消除了換行符。

拆分文本不是在默認的空格字符，但在正則表達式'\W+'（一切的不「單詞字符」）是我試圖擺脫標點符號。

下面我推薦碼：

import sys 
from collections import defaultdict 

dictfile = sys.argv[1] 
textfile = sys.argv[2] 

with open(textfile) as f: 
    text = f.read() 

with open(dictfile) as f: 
    lines = f.read() 

categories = {} 
scores = defaultdict(int) 

current_category = "Default" 
scores[current_category] = 0 

for line in lines.split('\n'): 
    if line.startswith('>>'): 
     current_category = line.strip('>') 
    else: 
     keyword = line.strip() 
     if keyword: 
      categories[keyword] = current_category 

for word in re.split('\W+', text): 
    try: 
     scores[categories[word]] += 1 
    except KeyError: 
     # no in dictionary 
     pass 

for keyword in scores.keys(): 
    print("{}: {}".format(keyword, scores[keyword]))

來源

2016-12-06 13:41:25

謝謝你的代碼羅賓，但不幸的是它只計算所有的單詞。也許我用錯誤的方式解釋了它，所以我試圖說清楚： - 我在.txt字典上 - 在這本字典中有8種不同的情緒，這些情緒與這些情緒有關。 - 我有一個文本，我想檢查字典中的單詞。 - 在文本中代表單詞情感，此時代碼在詞典感覺出現在詞典中（部分感覺）時發出匹配。目的是當詞典中的確切單詞出現時，它纔會進行匹配。我希望我現在說清楚了嗎？ – Guido

代碼完成與您完全相同的操作。計算單詞併爲每個類別添加它們。但是我不使用'.match（）'，而是直接比較單詞。 - 請提供示例文件，從中可以看出您認爲看到的不同之處。 –

這是文件。在idtext文件中顯示'wantrouwen'。當你運行這個腳本時，你使用的字典僅附有'trouwen'這個詞，它不應該匹配。 – Guido

鮮明的詞情感分析

回答

相關問題