更新字典值與文件中的下一個單詞？

我想讀取一個文件，並創建一個字典，每個單詞作爲一個關鍵字，並將其作爲一個單詞作爲值。更新字典值與文件中的下一個單詞？

例如，如果我有一個包含文件：

'Cake is cake okay.'

創建應該包含的詞典：

{'cake': ['is', 'okay'], 'is': ['cake'], 'okay': []}

到目前爲止，我已經成功地做我的代碼相反。我已經用文件中的前一個字更新了字典值。我不太清楚如何改變它以使其按預期工作。

def create_dict(file): 

    word_dict = {} 
    prev_word = '' 

    for line in file: 

     for word in line.lower().split(): 
      clean_word = word.strip(string.punctuation) 

      if clean_word not in word_dict: 
       word_dict[clean_word] = [] 

      word_dict[clean_word].append(prev_word) 
      prev_word = clean_word

謝謝你們提前幫忙！

編輯

與最新進展：

def create_dict(file): 
    word_dict = {} 
    next_word = '' 

    for line in file: 
     formatted_line = line.lower().split() 

     for word in formatted_line: 
      clean_word = word.strip(string.punctuation) 

      if next_word != '': 
       if next_word not in word_dict: 
        word_dict[next_word] = [] 

      if clean_word == '': 
       clean_word. 

      next_word = clean_word 
    return word_dict

來源

2016-03-03 FlyingGiraffes

您可以使用itertools.zip_longest()和dict.setdefault()較短的解決方案：

import io 
from itertools import zip_longest # izip_longest in Python 2 
import string 

def create_dict(fobj): 
    word_dict = {} 
    punc = string.punctuation 
    for line in fobj: 
     clean_words = [word.strip(punc) for word in line.lower().split()] 
     for word, next_word in zip_longest(clean_words, clean_words[1:]): 
      words = word_dict.setdefault(word, []) 
      if next_word is not None: 
       words.append(next_word) 
    return word_dict

測試：

>>> fobj = io.StringIO("""Cake is cake okay.""") 
>>> create_dict(fobj) 
{'cake': ['is', 'okay'], 'is': ['cake'], 'okay': []}

來源

2016-03-04 10:40:08

分離，從創建該兩字組字典（這個問題的主題）的代碼生成從給定的文件中的單詞（在空間分割，殼體摺疊，剝離標點符號等）的代碼：

#!/usr/bin/env python3 
from collections import defaultdict 
from itertools import tee 

def create_bigram_dict(words): 
    a, b = tee(words) # itertools' pairwise recipe 
    next(b) 
    bigrams = defaultdict(list) 
    for word, next_word in zip(a, b): 
     bigrams[word].append(next_word) 
    bigrams[next_word] # last word may have no following words 
    return bigrams

見itertools' pairwise() recipe。要在一個文件中支持少於兩個單詞，代碼需要稍微調整。如果您需要確切的類型，您可以在這裏撥打return dict(bigrams)。例如：

>>> create_bigram_dict('cake is cake okay'.split()) 
defaultdict(list, {'cake': ['is', 'okay'], 'is': ['cake']}

若要從文件中的字典，您可以定義get_words(file)：

#!/usr/bin/env python3 
import regex as re # $ pip install regex 

def get_words(file): 
    with file: 
     for line in file: 
      words = line.casefold().split() 
      for w in words: 
       yield re.fullmatch(r'\p{P}*(.*?)\p{P}*', w).group(1)

用法：create_bigram_dict(get_words(open('filename')))。

To strip Unicode punctuation, \p{P} regex is used。該代碼可以保存標點符號內詞例如： -

>>> import regex as re 
>>> re.fullmatch(r'\p{P}*(.*?)\p{P}*', "doesn't.").group(1) 
"doesn't"

注：點，結束時消失，但'內被保留。要刪除所有標點符號，可以使用s = re.sub(r'\p{P}+', '', s)：

>>> re.sub(r'\p{P}+', '', "doesn't.") 
'doesnt'

注意：單引號也沒有了。

來源

2016-03-04 21:03:42 jfs

更新字典值與文件中的下一個單詞？

回答

相關問題