迭代文檔時存儲3個不同的變量（字典或列表）？

我在幾個文檔中遍歷數十萬個單詞，希望找到英語收縮頻率。我已經適當地格式化了文件，現在是編寫正確的功能並正確存儲數據的問題。我需要爲每個發現收縮的文檔存儲信息，以及它們在文檔中的使用頻率。理想情況下，我的數據幀看起來像以下：迭代文檔時存儲3個不同的變量（字典或列表）？

filename contraction count 
file1  it's   34 
file1  they're  13 
file1  she's   9 
file2  it's   14 
file2  we're   15 
file3  it's   4 
file4  it's   45 
file4  she's   13

我怎樣才能最好的去嗎？

編輯：這是我的代碼，迄今：

for i in contractions_list:  # for each of the 144 contractions in my list 
    for l in every_link:  # for each speech 
     count = 0 
     word_count = 0 
     content_2 = processURL_short(l) 
     for word in content2.split(): 
      word = word.strip(p) 
      word_count = word_count + 1 
      if i in contractions: 
       count = count + 1

凡processURL_short()是一個功能我寫的擦傷一個網站，並返回一個講話str。

EDIT2：

link_store = {} 
for i in contractions_list_test:  # for each of the 144 contractions 
    for l in every_link_test:  # for each speech 
     link_store[l] = {} 
     count = 0 
     word_count = 0 
     content_2 = processURL_short(l) 
     for word in content_2.split(): 
      word = word.strip(p) 
      word_count = word_count + 1 
      if word == i: 
       count = count + 1 
     if count: link_store[l][i] = count 
     print i,l,count

這是我的文件命名代碼：

splitlink = l.split("/") 
president = splitlink[4] 
speech_num = splitlink[-1] 
filename = "{0}_{1}".format(president,speech_num)

來源

2015-11-03 blacksite

總輸入流量有多大？用發生器流提供字典可能是您的最佳解決方案。 – Prune

如果我正確理解輸入流的意思，那麼有900個文本文件流（全部不超過25000個字，平均大約10000個），並且在我的字典中有144個縮寫 – blacksite

正確。在這種情況下，此時不需要更改其他代碼。如果您確實獲得了更大的文件，請考慮學習如何編寫Python生成器（請參閱** yield **語句），並且可以在不犧牲很多速度（通常在10％，有時更快）的情況下節省運行時內存。 – Prune

你可以有你的結構設置是這樣的：

links = {} 

for l in every_link: 
    links[l] = {} 
    for i in contractions_list: 
     count = 0 
     ... #here is where you do your count, which you seem to know how to do 
     ... #note that in your code, i think you meant if i in word/ if i == word for your final if statement 
     if count: links[l][i] = count #only adds the value if count is not 0

你會最終有一個數據結構如下：

links = { 
'file1':{ 
    "it's":34, 
    "they're":14, 
    ..., 
    }, 
'file2':{ 
    ...., 
    }, 
..., 
}

，你可以很容易地遍歷編寫必要的數據文件（我再假設你知道，因爲它看似不是問題的一部分，該怎麼辦）

來源

2015-11-03 20:59:48

請參閱上面的修改。我認爲，對於每次收縮，遍歷所有文檔會更好。但是，我收到一本空字典！ – blacksite

你能夠打印'contractions_list_test，every_link_test和content_2'的值嗎？它可能與這些值 –

'contractions_list_test'看起來像'[「她會」，「不應該」，「她會」，「不」，「某人」，「應該有「，」不會「，」將會「，」將會是誰「，」他是「，」當「，」我們已經「，」某人「」，「他」 d「，」ma'am「]'，every_link_test看起來像'['http://www.millercenter.org/president/obama/speeches/speech-4427'，'http://www.millercenter.org/president/obama/speeches/speech-4424'，'http://www.millercenter.org/president/obama/speeches/speech-4453']'，而'content_2'只是一個文本體，就像''我是總統，我不想回家，不要去那邊。「'那樣的東西 – blacksite

字典似乎這裏是最好的選擇，因爲他們會允許你更輕鬆地操作你的數據。你的目標應該是將索引的結果索引爲link（您的語音文本的URL）的文件名提取到收縮和其計數的映射。

喜歡的東西：

{"file1": {"it's": 34, "they're": 13, "she's": 9}, 
"file2": {"it's": 14, "we're": 15}, 
"file3": {"it's": 4}, 
"file4": {"it's": 45, "she's": 13}}

下面是完整的代碼：

ret = {} 
for link, text in ((l, processURL_short(l)) for l in every_link): 
    contractions = {c:0 for c in contractions_list} 
    for word in text.split(): 
     try: 
      contractions[word] += 1 
     except KeyError: 
      # Word or contraction not found. 
      pass 
    ret[file_naming_code(link)] = contractions

讓我們進入每一步。

首先我們初始化ret，它會得到字典。然後，我們使用 generator expressions 爲每個步驟執行processURL_short()（而不是立即通過所有鏈接列表中的）。我們返回一個元組列表(<link-name>, <speech-test>)，以便我們稍後可以使用鏈接名稱。
Next這是收縮計數映射，初始化爲0 s，它將用於計數收縮。
然後我們分裂文成的話，對於每一個我們尋找它在收縮映射的話，如果發現再算上它，否則 KeyError將提高對未找到每個鍵。

（另一個問題指出，這將表現不佳，另一個一種可能是檢查與in，像word in contractions）
最後：
```
ret[file_naming_code(link)] = contractions 
```
現在ret是文件名映射的字典宮縮發生。現在，您可以使用輕鬆地創建表：

這裏是你怎麼能夠讓你的輸出：

print '\t'.join(('filename', 'contraction', 'count')) 
for link, counts in ret.items(): 
    for name, count in counts.items(): 
     print '\t'.join((link, name, count))

來源

2015-11-03 21:07:18 jvdm

打開和讀取速度慢的操作：通過整個文件列表不循環144次。

異常緩慢：在每次演講中對每個非收縮都拋出異常將會很沉重。

不要循環檢查單詞的收縮列表。相反，使用內置的中的函數來查看收縮是否在列表中，然後使用字典計算條目，就像您可能手動完成一樣。

瀏覽文件，逐字。當你在收縮列表上看到一個單詞時，看看它是否已經在你的理貨單上。如果是，請添加一個標記，如果沒有，則將其添加到計數爲1的工作表中。

下面是一個示例。我做了非常短的演講和一個微不足道的processURL_short函數。

def processURL_short(string): 
    return string.lower() 

every_link = [ 
    "It's time for going to Sardi's", 
    "We're in the mood; it's about DST", 
    "They're he's it's don't", 
    "I'll be home for Christmas"] 

contraction_list = [ 
    "it's", 
    "don't", 
    "can't", 
    "i'll", 
    "he's", 
    "she's", 
    "they're" 
] 

for l in every_link:  # for each speech 
    contraction_count = {} 
    content = processURL_short(l) 

    for word in content.split(): 
     if word in contraction_list: 
      if word in contraction_count: 
       contraction_count[word] += 1 
      else: 
       contraction_count[word] = 1 

    for key, value in contraction_count.items(): 
     print key, '\t', value

來源

2015-11-03 21:28:04 Prune

好的。這種方法非常有效，但我仍然希望能夠將每個語音的「文件名」與每個縮寫存儲在一起，以便我的字典看起來像上面提到的表格。 – blacksite

啊 - 所以我們計算每個演講的收縮，而不是整體。精細。在**語句的第一個**之前移動** contraction_count = {} **，然後根據需要將您的代碼放入該循環中。 – Prune

如何獲得與每個單獨文檔中發現的宮縮相關的'filename'？我將遍歷900個演講（包含在'every_link'中），並希望將每個文檔的文件名與收縮計數配對。 – blacksite

迭代文檔時存儲3個不同的變量（字典或列表）？

回答

相關問題