2016-03-03 52 views
3

我想讀取一個文件,並創建一個字典,每個單詞作爲一個關鍵字,並將其作爲一個單詞作爲值。更新字典值與文件中的下一個單詞?

例如,如果我有一個包含文件:

'Cake is cake okay.' 

創建應該包含的詞典:

{'cake': ['is', 'okay'], 'is': ['cake'], 'okay': []} 

到目前爲止,我已經成功地做我的代碼相反。我已經用文件中的前一個字更新了字典值。我不太清楚如何改變它以使其按預期工作。

def create_dict(file): 

    word_dict = {} 
    prev_word = '' 

    for line in file: 

     for word in line.lower().split(): 
      clean_word = word.strip(string.punctuation) 

      if clean_word not in word_dict: 
       word_dict[clean_word] = [] 

      word_dict[clean_word].append(prev_word) 
      prev_word = clean_word 

謝謝你們提前幫忙!

編輯

與最新進展:

def create_dict(file): 
    word_dict = {} 
    next_word = '' 

    for line in file: 
     formatted_line = line.lower().split() 

     for word in formatted_line: 
      clean_word = word.strip(string.punctuation) 

      if next_word != '': 
       if next_word not in word_dict: 
        word_dict[next_word] = [] 

      if clean_word == '': 
       clean_word. 

      next_word = clean_word 
    return word_dict 

回答

1

您可以使用itertools.zip_longest()dict.setdefault()較短的解決方案:

import io 
from itertools import zip_longest # izip_longest in Python 2 
import string 

def create_dict(fobj): 
    word_dict = {} 
    punc = string.punctuation 
    for line in fobj: 
     clean_words = [word.strip(punc) for word in line.lower().split()] 
     for word, next_word in zip_longest(clean_words, clean_words[1:]): 
      words = word_dict.setdefault(word, []) 
      if next_word is not None: 
       words.append(next_word) 
    return word_dict 

測試:

>>> fobj = io.StringIO("""Cake is cake okay.""") 
>>> create_dict(fobj) 
{'cake': ['is', 'okay'], 'is': ['cake'], 'okay': []} 
0

分離,從創建該兩字組字典(這個問題的主題)的代碼生成從給定的文件中的單詞(在空間分割,殼體摺疊,剝離標點符號等)的代碼:

#!/usr/bin/env python3 
from collections import defaultdict 
from itertools import tee 

def create_bigram_dict(words): 
    a, b = tee(words) # itertools' pairwise recipe 
    next(b) 
    bigrams = defaultdict(list) 
    for word, next_word in zip(a, b): 
     bigrams[word].append(next_word) 
    bigrams[next_word] # last word may have no following words 
    return bigrams 

itertools' pairwise() recipe。要在一個文件中支持少於兩個單詞,代碼需要稍微調整。如果您需要確切的類型,您可以在這裏撥打return dict(bigrams)。例如:

>>> create_bigram_dict('cake is cake okay'.split()) 
defaultdict(list, {'cake': ['is', 'okay'], 'is': ['cake']} 

若要從文件中的字典,您可以定義get_words(file)

#!/usr/bin/env python3 
import regex as re # $ pip install regex 

def get_words(file): 
    with file: 
     for line in file: 
      words = line.casefold().split() 
      for w in words: 
       yield re.fullmatch(r'\p{P}*(.*?)\p{P}*', w).group(1) 

用法:create_bigram_dict(get_words(open('filename')))


To strip Unicode punctuation, \p{P} regex is used。該代碼可以保存標點符號詞例如: -

>>> import regex as re 
>>> re.fullmatch(r'\p{P}*(.*?)\p{P}*', "doesn't.").group(1) 
"doesn't" 

注:點,結束時消失,但'內被保留。要刪除所有標點符號,可以使用s = re.sub(r'\p{P}+', '', s)

>>> re.sub(r'\p{P}+', '', "doesn't.") 
'doesnt' 

注意:單引號也沒有了。