的grep/SED/AWK - 摘自HTML代碼

-2

串我想從HTML代碼中這樣的值：的grep/SED/AWK - 摘自HTML代碼

<div>Luftfeuchte: <span id="wob_hm">53%</span></div><div>Wind:

至於結果，我只需要值：「53」

哪有這可以使用像grep，awk或sed這樣的Linux命令行工具來完成。我想用它在樹莓派...... [R

嘗試這並不工作：

[email protected]:/home/pi# echo "<div>Luftfeuchte: <span id="wob_hm">53%</span></div><div>Wind:" >> test.txt 
[email protected]:/home/pi# grep -oP '<span id="wob_hm">\K[0-9]+(?=%</span>)' test.txt 
[email protected]:/home/pi#

來源

2015-04-04 fhammer

您是否願意使用適當的HTML解析器解決方案？這可以使用正則表達式，但是學習使用類似perl/python的東西來解決這些問題會好得多。 – 2015-04-04 17:45:58

Obligatory [不要使用正則表達式解析（x）html]（http://stackoverflow.com/a/1732454/7552）鏈接。 – 2015-04-04 17:57:14

因爲HTML是不平坦的文本格式，包括平面的文字工具，如grep處理它， sed或awk是不可取的。如果HTML的格式略有變化（例如：如果span節點獲得另一個屬性或者在某處插入換行符），那麼以這種方式構建的任何內容都有可能中斷。

它更健壯（如果更費力）使用構建解析HTML的東西。在這種情況下，我會考慮使用Python，因爲它的標準庫中有一個（基本的）HTML解析器。它可能看起來大致是這樣的：

#!/usr/bin/python3 

import html.parser 
import re 
import sys 

# html.parser.HTMLParser provides the parsing functionality. It tokenizes 
# the HTML into tags and what comes between them, and we handle them in the 
# order they appear. With XML we would have nicer facilities, but HTML is not 
# a very good format, so we're stuck with this. 
class my_parser(html.parser.HTMLParser): 
    def __init__(self): 
     super(my_parser, self).__init__(self) 
     self.data = '' 
     self.depth = 0 

    # handle opening tags. Start counting, assembling content when a 
    # span tag begins whose id is "wob_hm". A depth counter is maintained 
    # largely to handle nested span tags, which is not strictly necessary 
    # in your case (but will make this easier to adapt for other things and 
    # is not more complicated to implement than a flag) 
    def handle_starttag(self, tag, attrs): 
     if tag == 'span': 
      if ('id', 'wob_hm') in attrs: 
       self.data = '' 
       self.depth = 0 
      self.depth += 1 

    # handle end tags. Make sure the depth counter is only positive 
    # as long as we're in the span tag we want 
    def handle_endtag(self, tag): 
     if tag == 'span': 
      self.depth -= 1 

    # when data comes, assemble it in a string. Note that nested tags would 
    # not be recorded by this if they existed. It would be more work to 
    # implement that, and you don't need it for this. 
    def handle_data(self, data): 
     if self.depth > 0: 
      self.data += data 

# open the file whose name is the first command line argument. Do so as 
# binary to get bytes from f.read() instead of a string (which requires 
# the data to be UTF-8-encoded) 
with open(sys.argv[1], "rb") as f: 
    # instantiate our parser 
    p = my_parser() 

    # then feed it the file. If the file is not UTF-8, it is necessary to 
    # convert the file contents to UTF-8. I'm assuming latin1-encoded 
    # data here; since the example looks German, "latin9" might also be 
    # appropriate. Use the encoding in which your data is encoded. 
    p.feed(f.read().decode("latin1")) 

    # trim (in case of newlines/spaces around the data), remove % at the end, 
    # then print 
    print(re.compile('%$').sub('', p.data.strip()))

附錄：這裏有一個反向移植到Python 2中bulldozes就在編碼問題。對於這種情況，這可以說是更好，因爲編碼對於我們想要提取的數據無關緊要，並且您不必事先知道輸入文件的編碼。這些變化是微不足道的，它的工作方式是完全相同的：

#!/usr/bin/python 

from HTMLParser import HTMLParser 
import re 
import sys 

class my_parser(HTMLParser): 
    def __init__(self): 
     HTMLParser.__init__(self) 
     self.data = '' 
     self.depth = 0 

    def handle_starttag(self, tag, attrs): 
     if tag == 'span': 
      if ('id', 'wob_hm') in attrs: 
       self.data = '' 
       self.depth = 0 
      self.depth += 1 

    def handle_endtag(self, tag): 
     if tag == 'span': 
      self.depth -= 1 

    def handle_data(self, data): 
     if self.depth > 0: 
      self.data += data 

with open(sys.argv[1], "r") as f: 
    p = my_parser() 
    p.feed(f.read()) 
    print(re.compile('%$').sub('', p.data.strip()))

來源

2015-04-04 18:56:17 Wintermute

Thx爲答案，但嘗試這我得到：根@ raspberrypi：/ home/pi/grep谷歌天氣＃./test.py test.html 追溯（最近呼叫最後）：文件「。 /test.py「，第46行，在 p.feed（f.read（））文件」/usr/lib/python3.2/codecs.py「，第300行，解碼爲（result，consume） = self._buffer_decode（data，self.errors，final） UnicodeDecodeError：'utf-8'編解碼器無法解碼16022位的字節0xfc：無效的起始字節 – fhammer 2015-04-05 01:04:28

iso8859編碼的數據，eh？查看編輯。在將文件內容傳遞給html.parser.HTMLParser之前，這是一個小小的改變，它顯然需要Python 3中的UTF-8。我稍後可能會回過頭來將它移植到Python 2中，我認爲這會處理更多優雅，但我在此之前需要睡眠。 – Wintermute 2015-04-05 01:25:19

呃，我馬上做了python 2 backport。原來需要幾乎沒有改變，並且python 2'HTMLParser'具有（對於這種情況）不關心編碼的好性質。老實說，我有點惱火，那是在python 3中沒有替換的情況下被刪除的。 – Wintermute 2015-04-05 01:35:14

的grep/SED/AWK - 摘自HTML代碼

回答

相關問題