維基百科的文章頻率文章

如何獲取維基百科文章中指定詞的頻率而不存儲整篇文章然後處理它？對於例如，怎麼可能會時代的「印」字這篇文章https://simple.wikipedia.org/wiki/India 維基百科的文章頻率文章

2017-10-11 Sarthak Gupta

在這裏出現是一個頭腦簡單的例子逐行讀取網頁線。但是不能保證HTML被分成行。（正是在這種情況下，在他們的1300。）

import re 
import urllib.request 
from collections import Counter 

URL = 'https://simple.wikipedia.org/wiki/India' 

counter = Counter() 

with urllib.request.urlopen(URL) as source: 
    for line in source: 
     words = re.split(r"[^A-Z]+", line.decode('utf-8'), flags=re.I) 
     counter.update(words) 

for word in ['India', 'Indian', 'Indians']: 
    print('{}: {}'.format(word, counter[word]))

輸出

> python3 test.py 
India: 547 
Indian: 75 
Indians: 11 
>

，如果他們出現在頁面上，而不僅僅是內容的HTML結構這也計算方面。

如果您想關注內容，請考慮使用首選MediaWiki API提取內容的Pywikibot python library，儘管它似乎基於您試圖避免注意的「一次完成的頁面」模型。無論如何，該模塊的文檔都指向您可能需要查看的類似但更高級的軟件包列表。

來源

2017-10-11 08:28:50 cdlane

維基百科的文章頻率文章

回答

相關問題