2017-01-23 98 views
0

所以我有一個<span>標籤style="font-size:...px"一堆HTML文件,我想自動找到最大的字體大小的<span>並獲取跨度標籤之間的文本。最好在R或Python中,但也歡迎其他方法。有任何想法嗎?使用R或python解析HTML屬性

回答

1

對於Python 3,您可以使用html.parser。 (對於Python 2.x的,你需要看HTMLParser

一個例子是:

from html.parser import HTMLParser 

class MyHTMLParser(HTMLParser): 

    def __init__(self, min_span): 
     HTMLParser.__init__(self) 

     #Keep track of our maximum entry thus far 
     self.max_span = min_span #set a minimum font size if you like, or just use 0 
     self.max_text = [] #to keep track of many entries 

     #This flags to the object to get data if we found a span tag 
     #with a new highest font-size 
     self.recording = 0 

    def handle_starttag(self, tag, attrs): 

     #Ignore all other tags 
     if tag != 'span': 
      return 

     for name, value in attrs: 
      if name != 'style': 
       continue 

      for css_style in value.split(";"): 
       sub_attrib = css_style.split(":") 
       if sub_attrib[0].strip() != 'font-size': 
        continue 

       this_size = int(sub_attrib[1][:-2]) 
       if (this_size > self.max_span): 
        self.max_text = [] #'reset' the list for new maximum font-size 
        self.max_span = this_size 
        self.recording = 1 
       elif (this_size == self.max_span): #For equally large span font-size tags 
        self.recording = 1 

    def handle_endtag(self, tag): 
     """ 
     Turns off recording flag 
     """ 
     if tag == 'span' and self.recording: 
      self.recording = 0 

    def handle_data(self, data): 
     if self.recording: 
      self.max_text.append(data) 

不是很好的HTML(如顯而易見我以前的答案),所以你可能需要更多的控制對於流邊緣的情況下

用途:

parser = MyHTMLParser(0) 
parser.feed(""" 
<!DOCTYPE html> 
<html> 
<body> 

    <h1>My First Heading</h1> 

    <p>My first paragraph.</p> 

    <span style="font-size:10px;font-family:test">Not this one</span> 
    <span style="font-size:20px">Not this one either</span> 
    <span style="font-size:60px;font-family:hello">Yay!</span> 
    <span style="font-size:10px">Nope</span> 
    <span style="font-size:60px">Also this one</span> 

</body> 
</html> 
""") 

print(parser.max_text) #prints out ['Yay!', 'Also this one'] 

#to get individual entry 
list_of_text = parser.max_text 
first_maximum_text = list_of_text[0] 

編輯:對於一個目錄遍歷所有的HTML文件去(例如,它是當前目錄)。此實現會發現在所有的HTML文件的最大值(如果要分析一次爲每個HTML文件,初始化每次迭代後MyHTMLParser及處理結果)

import os 

def main(): 

    parser = MyHTMLParser(0) 

    for file in os.listdir("./"): 
     if file.endswith(".html"): 
      with open(file, 'r') as fd: 
       parser.feed(fd.read()) 

    print(parser.max_text) 

if __name__ == '__main__': 
    main() 
+0

評論不適合廣泛的討論;這個對話已經[轉移到聊天](http://chat.stackoverflow.com/rooms/134080/discussion-on-answer-by-kj-phan-parsing-html-attribute-using-r-or-python) 。 –