Python用美麗的湯解析和過濾停用詞

我將從網站的特定信息解析到文件中。現在我已經看到了一個網頁的程序，找到了正確的HTML標籤並解析出了正確的內容。現在我想進一步過濾這些「結果」。Python用美麗的湯解析和過濾停用詞

例如，在網站上：http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx

我解析了分別位於< DIV CLASS = 「成分」 ...>標籤中的成分。這個解析器很好地完成了這項工作，但我想進一步處理這些結果。

當我運行此解析器時，它將刪除數字，符號，逗號和斜槓（\或/），但會保留所有文本。當我在網站上運行它，我得到這樣的結果：

cup olive oil 
cup chicken broth 
cloves garlic minced 
tablespoon paprika

現在我想通過刪除，如「杯具」停止詞的進一步處理此，「丁香」，其中包括「剁碎」，「tablesoon」。我到底該怎麼做？這段代碼是用python編寫的，我不是很擅長，我只是使用這個解析器來獲取我可以手動輸入的信息，但我寧願不要。

任何有關如何做到這一點的詳細幫助將不勝感激！我的代碼如下：我將如何做到這一點？

代碼：

import urllib2 
import BeautifulSoup 

def main(): 
    url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx" 
    data = urllib2.urlopen(url).read() 
    bs = BeautifulSoup.BeautifulSoup(data) 

    ingreds = bs.find('div', {'class': 'ingredients'}) 
    ingreds = [s.getText().strip('123456789.,/\ ') for s in ingreds.findAll('li')] 

    fname = 'PorkRecipe.txt' 
    with open(fname, 'w') as outf: 
     outf.write('\n'.join(ingreds)) 

if __name__=="__main__": 
    main()

來源

2011-04-12 Eric

import urllib2 
import BeautifulSoup 
import string 

badwords = set([ 
    'cup','cups', 
    'clove','cloves', 
    'tsp','teaspoon','teaspoons', 
    'tbsp','tablespoon','tablespoons', 
    'minced' 
]) 

def cleanIngred(s): 
    # remove leading and trailing whitespace 
    s = s.strip() 
    # remove numbers and punctuation in the string 
    s = s.strip(string.digits + string.punctuation) 
    # remove unwanted words 
    return ' '.join(word for word in s.split() if not word in badwords) 

def main(): 
    url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx" 
    data = urllib2.urlopen(url).read() 
    bs = BeautifulSoup.BeautifulSoup(data) 

    ingreds = bs.find('div', {'class': 'ingredients'}) 
    ingreds = [cleanIngred(s.getText()) for s in ingreds.findAll('li')] 

    fname = 'PorkRecipe.txt' 
    with open(fname, 'w') as outf: 
     outf.write('\n'.join(ingreds)) 

if __name__=="__main__": 
    main()

結果

olive oil 
chicken broth 
garlic, 
paprika 
garlic powder 
poultry seasoning 
dried oregano 
dried basil 
thick cut boneless pork chops 
salt and pepper to taste

？我不知道爲什麼它留下逗號 - s.strip（string.punctuation）應該已經照顧到了。

來源

2011-04-12 03:00:53

嘿，工作！我不知道爲什麼它讓逗號在任何一個。但感謝您的幫助。我不是很熟悉它，只用了大約2周。所以，你把壞字設置成停用詞，並且行被拆分，那些詞只有在它們不存在於「壞詞」中時才被使用？ – Eric 2011-04-12 03:45:06

條只能刪除字符串的開頭或結尾的字符，當「大蒜，切碎」通過剝離時，它在它的中間 – tato 2013-04-12 15:22:29

Python用美麗的湯解析和過濾停用詞

回答

相關問題