Python和HTMLParser.handle_data（） - 如何從標籤獲取數據？

我想用Python HTMLParser解析一個網頁。我想獲取標籤的內容，但我不知道如何去做。這是我到目前爲止的代碼：Python和HTMLParser.handle_data（） - 如何從標籤獲取數據？

import urllib.request 
from html.parser import HTMLParser 

class MyHTMLParser(HTMLParser): 
    def handle_data(self, data): 
     print("Encountered some data:", data) 


url = "website" 
page = urllib.request.urlopen(url).read() 

parser = MyHTMLParser(strict=False) 
parser.feed(str(page))

如果我理解正確的話，我可以使用handle_data()函數來獲取標籤之間的數據。如何指定從哪個標籤獲取數據？我如何獲取數據？

來源

2011-12-13 user1049697

我建議你使用[BeautifulSoup]（http://www.crummy.com/software/BeautifulSoup/），因爲它有一個非常友好的界面。 – jcollado

不僅僅是因爲友好的界面，它更容易理解HTML格式的錯誤/不正確的HTML格式，你會在狂放的網頁上看到它。 – babbageclunk

我試過BeautifulSoup。我解析的頁面使它窒息。即使BeautifulSoup不起作用，你怎麼做？ :) – user1049697

html_code = urllib2.urlopen("xxx") 
html_code_list = html_code.readlines() 
data = "" 
for line in html_code_list: 
    line = line.strip() 

    if line.startswith("<h2"): 
     data = data+line 

hp = MyHTMLParser() 
hp.feed(data) 
hp.close()

因此，你可以從H2標籤中提取數據，希望它可以幫助

來源

2012-01-12 04:45:26 Yanan

不好！不要用這個解析HTML！ – Dan

什麼是解析HTML的最佳方式？我試過HTMLParser，解析速度很慢 – Yanan

我沒有時間格式化/打掃一下，但我就是這樣平時做：

 class HTMLParse(HTMLParser.HTMLParser): 
      def handle_starttag(self, tag, attr): 
       if tag.lower() == "a": 
        for item in attr: 
         #print item 
         if item[0].lower() == "href": 
          path = urlparse.urlparse(item[1]).path 
          ext = os.path.splitext(path)[1] 
          if ext.lower() in (".jpeg", ".jpg", ".png", 
               ".bmp"): 
           print "Found: "+ item[1]

來源

2012-01-12 05:01:01 user393899

class HTMLParse(HTMLParser.HTMLParser): 
    def handle_starttag(self, tag, attrs): 
     if tag =="h2": 
      self.recordh2 = True 
    def handle_endtag(self, tag, attrs): 
     if tag =="h2": 
      self.recordh2 = False 
    def handle_data(self, data): 
     if self.recordh2 == True: 
      #do your work here

來源

2014-01-10 20:47:47 hwang

有沒有辦法僅僅檢索標籤之間的數據？我的意思是，每個人都建議使用BS或lxml，但是如果可能的話，我想嘗試使用HTMLParser，因爲我的應用程序非常簡單（並且我希望學習在命令行界面中進行簡單的操作） ... –

Python和HTMLParser.handle_data（） - 如何從標籤獲取數據？

回答

相關問題