2015-04-04 45 views
-2

串我想從HTML代碼中這樣的值:的grep/SED/AWK - 摘自HTML代碼

<div>Luftfeuchte: <span id="wob_hm">53%</span></div><div>Wind: 

至於結果,我只需要值: 「53」

哪有這可以使用像grep,awk或sed這樣的Linux命令行工具來完成。我想用它在樹莓派...... [R

嘗試這並不工作:

[email protected]:/home/pi# echo "<div>Luftfeuchte: <span id="wob_hm">53%</span></div><div>Wind:" >> test.txt 
[email protected]:/home/pi# grep -oP '<span id="wob_hm">\K[0-9]+(?=%</span>)' test.txt 
[email protected]:/home/pi# 
+2

您是否願意使用適當的HTML解析器解決方案?這可以使用正則表達式,但是學習使用類似perl/python的東西來解決這些問題會好得多。 – 2015-04-04 17:45:58

+2

Obligatory [不要使用正則表達式解析(x)html](http://stackoverflow.com/a/1732454/7552)鏈接。 – 2015-04-04 17:57:14

回答

0

因爲HTML是不平坦的文本格式,包括平面的文字工具,如grep處理它, sedawk是不可取的。如果HTML的格式略有變化(例如:如果span節點獲得另一個屬性或者在某處插入換行符),那麼以這種方式構建的任何內容都有可能中斷。

它更健壯(如果更費力)使用構建解析HTML的東西。在這種情況下,我會考慮使用Python,因爲它的標準庫中有一個(基本的)HTML解析器。它可能看起來大致是這樣的:

#!/usr/bin/python3 

import html.parser 
import re 
import sys 

# html.parser.HTMLParser provides the parsing functionality. It tokenizes 
# the HTML into tags and what comes between them, and we handle them in the 
# order they appear. With XML we would have nicer facilities, but HTML is not 
# a very good format, so we're stuck with this. 
class my_parser(html.parser.HTMLParser): 
    def __init__(self): 
     super(my_parser, self).__init__(self) 
     self.data = '' 
     self.depth = 0 

    # handle opening tags. Start counting, assembling content when a 
    # span tag begins whose id is "wob_hm". A depth counter is maintained 
    # largely to handle nested span tags, which is not strictly necessary 
    # in your case (but will make this easier to adapt for other things and 
    # is not more complicated to implement than a flag) 
    def handle_starttag(self, tag, attrs): 
     if tag == 'span': 
      if ('id', 'wob_hm') in attrs: 
       self.data = '' 
       self.depth = 0 
      self.depth += 1 

    # handle end tags. Make sure the depth counter is only positive 
    # as long as we're in the span tag we want 
    def handle_endtag(self, tag): 
     if tag == 'span': 
      self.depth -= 1 

    # when data comes, assemble it in a string. Note that nested tags would 
    # not be recorded by this if they existed. It would be more work to 
    # implement that, and you don't need it for this. 
    def handle_data(self, data): 
     if self.depth > 0: 
      self.data += data 

# open the file whose name is the first command line argument. Do so as 
# binary to get bytes from f.read() instead of a string (which requires 
# the data to be UTF-8-encoded) 
with open(sys.argv[1], "rb") as f: 
    # instantiate our parser 
    p = my_parser() 

    # then feed it the file. If the file is not UTF-8, it is necessary to 
    # convert the file contents to UTF-8. I'm assuming latin1-encoded 
    # data here; since the example looks German, "latin9" might also be 
    # appropriate. Use the encoding in which your data is encoded. 
    p.feed(f.read().decode("latin1")) 

    # trim (in case of newlines/spaces around the data), remove % at the end, 
    # then print 
    print(re.compile('%$').sub('', p.data.strip())) 

附錄:這裏有一個反向移植到Python 2中bulldozes就在編碼問題。對於這種情況,這可以說是更好,因爲編碼對於我們想要提取的數據無關緊要,並且您不必事先知道輸入文件的編碼。這些變化是微不足道的,它的工作方式是完全相同的:

#!/usr/bin/python 

from HTMLParser import HTMLParser 
import re 
import sys 

class my_parser(HTMLParser): 
    def __init__(self): 
     HTMLParser.__init__(self) 
     self.data = '' 
     self.depth = 0 

    def handle_starttag(self, tag, attrs): 
     if tag == 'span': 
      if ('id', 'wob_hm') in attrs: 
       self.data = '' 
       self.depth = 0 
      self.depth += 1 

    def handle_endtag(self, tag): 
     if tag == 'span': 
      self.depth -= 1 

    def handle_data(self, data): 
     if self.depth > 0: 
      self.data += data 

with open(sys.argv[1], "r") as f: 
    p = my_parser() 
    p.feed(f.read()) 
    print(re.compile('%$').sub('', p.data.strip())) 
+0

Thx爲答案,但嘗試這我得到: 根@ raspberrypi:/ home/pi/grep谷歌天氣#./test.py test.html 追溯(最近呼叫最後): 文件「。 /test.py「,第46行,在 p.feed(f.read()) 文件」/usr/lib/python3.2/codecs.py「,第300行,解碼爲 (result,consume) = self._buffer_decode(data,self.errors,final) UnicodeDecodeError:'utf-8'編解碼器無法解碼16022位的字節0xfc:無效的起始字節 – fhammer 2015-04-05 01:04:28

+0

iso8859編碼的數據,eh?查看編輯。在將文件內容傳遞給html.parser.HTMLParser之前,這是一個小小的改變,它顯然需要Python 3中的UTF-8。我稍後可能會回過頭來將它移植到Python 2中,我認爲這會處理更多優雅,但我在此之前需要睡眠。 – Wintermute 2015-04-05 01:25:19

+0

呃,我馬上做了python 2 backport。原來需要幾乎沒有改變,並且python 2'HTMLParser'具有(對於這種情況)不關心編碼的好性質。老實說,我有點惱火,那是在python 3中沒有替換的情況下被刪除的。 – Wintermute 2015-04-05 01:35:14