因爲HTML是不平坦的文本格式,包括平面的文字工具,如grep
處理它, sed
或awk
是不可取的。如果HTML的格式略有變化(例如:如果span
節點獲得另一個屬性或者在某處插入換行符),那麼以這種方式構建的任何內容都有可能中斷。
它更健壯(如果更費力)使用構建解析HTML的東西。在這種情況下,我會考慮使用Python,因爲它的標準庫中有一個(基本的)HTML解析器。它可能看起來大致是這樣的:
#!/usr/bin/python3
import html.parser
import re
import sys
# html.parser.HTMLParser provides the parsing functionality. It tokenizes
# the HTML into tags and what comes between them, and we handle them in the
# order they appear. With XML we would have nicer facilities, but HTML is not
# a very good format, so we're stuck with this.
class my_parser(html.parser.HTMLParser):
def __init__(self):
super(my_parser, self).__init__(self)
self.data = ''
self.depth = 0
# handle opening tags. Start counting, assembling content when a
# span tag begins whose id is "wob_hm". A depth counter is maintained
# largely to handle nested span tags, which is not strictly necessary
# in your case (but will make this easier to adapt for other things and
# is not more complicated to implement than a flag)
def handle_starttag(self, tag, attrs):
if tag == 'span':
if ('id', 'wob_hm') in attrs:
self.data = ''
self.depth = 0
self.depth += 1
# handle end tags. Make sure the depth counter is only positive
# as long as we're in the span tag we want
def handle_endtag(self, tag):
if tag == 'span':
self.depth -= 1
# when data comes, assemble it in a string. Note that nested tags would
# not be recorded by this if they existed. It would be more work to
# implement that, and you don't need it for this.
def handle_data(self, data):
if self.depth > 0:
self.data += data
# open the file whose name is the first command line argument. Do so as
# binary to get bytes from f.read() instead of a string (which requires
# the data to be UTF-8-encoded)
with open(sys.argv[1], "rb") as f:
# instantiate our parser
p = my_parser()
# then feed it the file. If the file is not UTF-8, it is necessary to
# convert the file contents to UTF-8. I'm assuming latin1-encoded
# data here; since the example looks German, "latin9" might also be
# appropriate. Use the encoding in which your data is encoded.
p.feed(f.read().decode("latin1"))
# trim (in case of newlines/spaces around the data), remove % at the end,
# then print
print(re.compile('%$').sub('', p.data.strip()))
附錄:這裏有一個反向移植到Python 2中bulldozes就在編碼問題。對於這種情況,這可以說是更好,因爲編碼對於我們想要提取的數據無關緊要,並且您不必事先知道輸入文件的編碼。這些變化是微不足道的,它的工作方式是完全相同的:
#!/usr/bin/python
from HTMLParser import HTMLParser
import re
import sys
class my_parser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.data = ''
self.depth = 0
def handle_starttag(self, tag, attrs):
if tag == 'span':
if ('id', 'wob_hm') in attrs:
self.data = ''
self.depth = 0
self.depth += 1
def handle_endtag(self, tag):
if tag == 'span':
self.depth -= 1
def handle_data(self, data):
if self.depth > 0:
self.data += data
with open(sys.argv[1], "r") as f:
p = my_parser()
p.feed(f.read())
print(re.compile('%$').sub('', p.data.strip()))
您是否願意使用適當的HTML解析器解決方案?這可以使用正則表達式,但是學習使用類似perl/python的東西來解決這些問題會好得多。 – 2015-04-04 17:45:58
Obligatory [不要使用正則表達式解析(x)html](http://stackoverflow.com/a/1732454/7552)鏈接。 – 2015-04-04 17:57:14