使用可讀性清除HTML標記

我使用readbility來檢索一些HTML頁面。我需要從沒有HTML標籤的HTML頁面獲取正文文本。我可以用readability來做到這一點嗎？使用可讀性清除HTML標記

2017-02-15 Mehdi

你想要整個頁面還是特定的部分？是否該頁面沒有HTML標籤，或者您希望提取沒有HTML標籤？ – celestialroad

我想提取文本，以便它沒有任何html標記。 – Mehdi

通過readability源代碼挖後，我發現，雖然確實出現了利用圖書館的cleaners模塊清理HTML的方式，使用的方法來檢索內容（使用lxml）存儲它作爲unicode文本。這是一個問題，因爲cleaners用於去除HTML標籤的方法返回Unicode對象的AttriuteError：

import requests 
from readability import Document 

response = requests.get('http://example.com') 
doc = Document(response.text) 
doc.summary() 
# raw content of HTML page with tags 
doc.get_clean_html() 
# AttributeError: 'unicode' object has no attribute 'get_clean_html'

它的出現，那就是，這個包還沒有看到積極的發展了一段時間，因此具有許多錯誤。

BeautifulSoup是另一個更好開發的庫，它可以完成readability所做的所有工作。對於同樣使用BeautifulSoup代替的問題，也有an excellent answer。這是一個長期的解決方案。

在短期內，根據頁面是多麼複雜，你可以使用re刪除所有的HTML標籤，並留下文字，如下圖所示使用我的網站：

import re 
import requests 
from readability import Document 

response = requests.get('http://ryanmcginnis.co/') 
doc = Document(response.text) 
cleanme = doc.summary() 
print(re.sub('<.*?>', '', cleanme))

該程序從我的文本網站返回純文本。

來源

2017-02-15 04:58:12 celestialroad

使用可讀性清除HTML標記

回答

相關問題