如何在使用Beautifulsoup時獲取文本標記

我是一個關於文本挖掘和處理玩具項目的新手，用於從網站中分割文本並將其分割爲令牌。然而，使用Beautifulsoup下載內容後，我沒把它與.split方法用下面的代碼如何在使用Beautifulsoup時獲取文本標記

# -*- coding: utf-8 -*- 
import nltk 
import operator 
import urllib3 
from bs4 import BeautifulSoup 

http = urllib3.PoolManager() 
url= 'http://python.org/' 
response = http.request('GET',url) 
# nltk.clean_html is dropped by NTLK 
clean = BeautifulSoup(response.data,"html5lib") 
# clean will have entire string removing all the html noise 
tokens = [tok for tok in clean.split()] 
print tokens[:100]

分裂的Python告訴我，

TypeError: 'NoneType' object is not callable

根據以前stackoverflow question，這是由於事實

clean不是一個字符串，它是一個bs4.element.Tag。當你試圖查找拆分它，它會發揮它的魔力，並試圖找到一個名爲拆分的子元素，但沒有。你打電話說沒有

在這種情況下，我應該如何調整我的代碼以實現獲取令牌的目標？謝謝。

來源

2017-09-05 zlqs1985

它幾乎在我看來，你沒有讀過的BeautifulSoup文檔：https://www.crummy.com/software/BeautifulSoup/bs4/doc/。沒有一種方法可以以有用的方式從頁面獲取令牌。有必要對每一頁進行研究。 –

[BeautifulSoup Grab Visible Webpage Text]的可能重複（https://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text） – Kos

你可以使用get_text()從HTML返回只是文本，並傳遞到NLTK word_tokenize()如下：

from bs4 import BeautifulSoup 
import requests 
import nltk 

response = requests.get('http://python.org/').content 
soup = BeautifulSoup(response, "html.parser") 
text_tokens = nltk.tokenize.word_tokenize(soup.get_text()) 

print text_tokens

（您也可以使用urllib3，讓您的數據）

這給你的東西出發：

[u'Welcome', u'to', u'Python.org', u'{', u'``', u'@', u'context', u"''", u':'...

如果你只在有興趣的話，你可以再進一步過濾返回的列表中刪除ENTR只有puncutation，例如IES：

text_tokens = [t for t in text_tokens if not re.match('[' + string.punctuation + ']+', t)]

來源

2017-09-05 15:50:51

謝謝，解決了我的問題 – zlqs1985

如何在使用Beautifulsoup時獲取文本標記

回答

相關問題