使用R或python解析HTML屬性

所以我有一個<span>標籤style="font-size:...px"一堆HTML文件，我想自動找到最大的字體大小的<span>並獲取跨度標籤之間的文本。最好在R或Python中，但也歡迎其他方法。有任何想法嗎？使用R或python解析HTML屬性

來源

2017-01-23 vdvaxel

對於Python 3，您可以使用html.parser。（對於Python 2.x的，你需要看HTMLParser）

一個例子是：

from html.parser import HTMLParser 

class MyHTMLParser(HTMLParser): 

    def __init__(self, min_span): 
     HTMLParser.__init__(self) 

     #Keep track of our maximum entry thus far 
     self.max_span = min_span #set a minimum font size if you like, or just use 0 
     self.max_text = [] #to keep track of many entries 

     #This flags to the object to get data if we found a span tag 
     #with a new highest font-size 
     self.recording = 0 

    def handle_starttag(self, tag, attrs): 

     #Ignore all other tags 
     if tag != 'span': 
      return 

     for name, value in attrs: 
      if name != 'style': 
       continue 

      for css_style in value.split(";"): 
       sub_attrib = css_style.split(":") 
       if sub_attrib[0].strip() != 'font-size': 
        continue 

       this_size = int(sub_attrib[1][:-2]) 
       if (this_size > self.max_span): 
        self.max_text = [] #'reset' the list for new maximum font-size 
        self.max_span = this_size 
        self.recording = 1 
       elif (this_size == self.max_span): #For equally large span font-size tags 
        self.recording = 1 

    def handle_endtag(self, tag): 
     """ 
     Turns off recording flag 
     """ 
     if tag == 'span' and self.recording: 
      self.recording = 0 

    def handle_data(self, data): 
     if self.recording: 
      self.max_text.append(data)

不是很好的HTML（如顯而易見我以前的答案），所以你可能需要更多的控制對於流邊緣的情況下

用途：

parser = MyHTMLParser(0) 
parser.feed(""" 
<!DOCTYPE html> 
<html> 
<body> 

    <h1>My First Heading</h1> 

    <p>My first paragraph.</p> 

    <span style="font-size:10px;font-family:test">Not this one</span> 
    <span style="font-size:20px">Not this one either</span> 
    <span style="font-size:60px;font-family:hello">Yay!</span> 
    <span style="font-size:10px">Nope</span> 
    <span style="font-size:60px">Also this one</span> 

</body> 
</html> 
""") 

print(parser.max_text) #prints out ['Yay!', 'Also this one'] 

#to get individual entry 
list_of_text = parser.max_text 
first_maximum_text = list_of_text[0]

編輯：對於一個目錄遍歷所有的HTML文件去（例如，它是當前目錄）。此實現會發現在所有的HTML文件的最大值（如果要分析一次爲每個HTML文件，初始化每次迭代後MyHTMLParser及處理結果）

import os 

def main(): 

    parser = MyHTMLParser(0) 

    for file in os.listdir("./"): 
     if file.endswith(".html"): 
      with open(file, 'r') as fd: 
       parser.feed(fd.read()) 

    print(parser.max_text) 

if __name__ == '__main__': 
    main()

來源

2017-01-23 23:46:33

評論不適合廣泛的討論;這個對話已經[轉移到聊天]（http://chat.stackoverflow.com/rooms/134080/discussion-on-answer-by-kj-phan-parsing-html-attribute-using-r-or-python）。 –

使用R或python解析HTML屬性

回答

相關問題