Python中的SGML解析器

我對Python完全陌生。我有以下代碼：Python中的SGML解析器

class ExtractTitle(sgmllib.SGMLParser): 

def __init__(self, verbose=0): 

    sgmllib.SGMLParser.__init__(self, verbose) 

    self.title = self.data = None 

def handle_data(self, data): 

    if self.data is not None: 
    self.data.append(data) 

def start_title(self, attrs): 
self.data = [] 

def end_title(self): 

    self.title = string.join(self.data, "") 

raise FoundTitle # abort parsing!

它從SGML中提取標題元素，但它只適用於單個標題。我知道我必須重載unknown_starttag和unknown_endtag才能獲得所有的標題，但我一直在錯誤的。請幫幫我！！！

來源

2011-01-08 afg102

你想做什麼？解析html文件？ – virhilo 2011-01-08 09:09:02

我有一個帶SGML的大文本文件，其中包含格式標籤新標題

新文本

。我希望我的代碼能夠在另一個文件中將此結果給我：新文本 – afg102 2011-01-08 09:14:13

使用lxml的，而不是SGMLParser這樣：

>>> posts = """ 
... <post id='100'> <title> xxxx </title> <text> <p> yyyyy </p> </text> </post> 
... <post id='101'> <title> new title1 </title> <text> <p> new text1 </p> </text> </post> 
... <post id='102'> <title> new title2 </title> <text> <p> new text2 </p> </text> </post> 
... """ 
>>> from lxml import html 
>>> parsed = html.fromstring(posts) 
>>> new_file = html.Element('div') 
>>> for post in parsed: 
...  post_id = post.attrib['id'] 
...  post_text = post.find('text').text_content() 
...  new_post = html.Element('post', id=post_id) 
...  new_post.text = post_text 
...  new_file.append(new_post) 
... 
>>> html.tostring(new_file) 
'<div><post id="100"> yyyyy </post><post id="101"> new text1 </post><post id="102"> new text2 </post></div>' 
>>>

來源

2011-01-08 09:35:25 virhilo

感謝您的回覆。我試圖從文件中提取所以我做了：filexy = open（fileurl）和posts = filexy.read（）然後你的代碼。然而，由於某種原因，它只顯示相同的文本（即它沒有循環遍歷所有標籤）你有什麼想法嗎？謝謝 – afg102 2011-01-08 10:17:14

你能否貼一些例子文件？ – virhilo 2011-01-08 10:20:40

Beautiful Soup是一種方法，你可以很好地分析它（這就是我想要的方式總是這樣做，除非有一些非常好的理由不這樣做就這樣，我自己）。它比使用SGMLParser更簡單，更具可讀性。

>>> from BeautifulSoup import BeautifulSoup 
>>> soup = BeautifulSoup('''<post id='100'> <title> new title </title> <text> <p> new text </p> </text> </post>''') 
>>> soup('post') # soup.findAll('post') is equivalent 
[<post id="100"> <title> new title </title> <text> <p> new text </p> </text> </post>] 
>>> for post in soup('post'): 
...  print post.findChild('text') 
... 
<text> <p> new text </p> </text>

一旦你在這個階段得到了它，你可以用它做各種事情，具體取決於你的需要。

>>> post = soup.find('post') 
>>> post 
<post id="100"> <title> new title </title> <text> <p> new text </p> </text> </post> 
>>> post_text = post.findChild('text') 
>>> post_text 
<text> <p> new text </p> </text>

你可能想剝離HTML。

>>> post_text.text 
u'new text'

也許看內容...

>>> post_text.renderContents() 
' <p> new text </p> '] 
>>> post_text.contents 
[u' ', <p> new text </p>, u' ']

有你能想到做各種各樣的事情。如果你更具體 - 尤其是提供真實數據 - 它會有所幫助。

當談到操縱樹時，你也可以這樣做。

>>> post 
<post id="100"> <title> new title </title> <text> <p> new text </p> </text> </post> 
>>> post.title # Just as good as post.findChild('title') 
<title> new title </title> 
>>> post.title.extract() # Throws it out of the tree and returns it but we have no need for it 
<title> new title </title> 
>>> post # title is gone! 
<post id="100"> <text> <p> new text </p> </text> </post> 
>>> post.findChild('text').replaceWithChildren() # Thrown away the <text> wrapping 
>>> post 
<post id="100"> <p> new text </p> </post>

所以，最後，你有這樣的事情：

>>> from BeautifulSoup import BeautifulSoup 
>>> soup = BeautifulSoup(''' 
... <post id='100'> <title> new title 100 </title> <text> <p> new text 100 </p> </text> </post> 
... <post id='101'> <title> new title 101 </title> <text> <p> new text 101 </p> </text> </post> 
... <post id='102'> <title> new title 102 </title> <text> <p> new text 102 </p> </text> </post> 
... ''') 
>>> for post in soup('post'): 
...  post.title.extract() 
...  post.findChild('text').replaceWithChildren() 
... 
<title> new title 100 </title> 
<title> new title 101 </title> 
<title> new title 102 </title> 
>>> soup 

<post id="100"> <p> new text 100 </p> </post> 
<post id="101"> <p> new text 101 </p> </post> 
<post id="102"> <p> new text 102 </p> </post>

來源

2011-01-08 09:37:30

您的代碼重置「標題」屬性每次end_title（）被調用。最終的標題是文檔中的最後一個標題。

你需要做的是存儲你找到的所有標題列表。在下面，我也將數據重置爲無（因此您不會收集標題元素以外的文本數據），而我使用「.join」而不是「string.join」，因爲您使用後者被認爲是過時的

class ExtractTitle(sgmllib.SGMLParser): 
    def __init__(self, verbose=0): 
    sgmllib.SGMLParser.__init__(self, verbose) 
    self.titles = [] 
    self.data = None 

    def handle_data(self, data): 
    if self.data is not None: 
     self.data.append(data) 

    def start_title(self, attrs): 
    self.data = [] 

    def end_title(self): 
    self.titles.append("".join(self.data)) 
    self.data = None

，在這裏它是在使用：

>>> parser = ExtractTitle() 
>>> parser.feed("<doc><rec><title>Spam and Eggs</title></rec>" + 
...    "<rec><title>Return of Spam and Eggs</title></rec></doc>") 
>>> parser.close() 
>>> parser.titles 
['Spam and Eggs', 'Return of Spam and Eggs'] 
>>>

來源

2011-01-08 13:48:16

Python中的SGML解析器

回答

相關問題