2011-11-30 52 views
9

我試圖從使用BeautifulSoup網站刮講話。然而,我遇到了問題,因爲演講分爲許多不同的段落。我對編程非常陌生,無法解決如何處理這個問題。該頁面的HTML看起來像這樣:刮與BeautifulSoup和多個段落

<span class="displaytext">Thank you very much. Mr. Speaker, Vice President Cheney, 
Members of Congress, distinguished guests, fellow citizens: As we gather tonight, our Nation is  
at war; our economy is in recession; and the civilized world faces unprecedented dangers. 
Yet, the state of our Union has never been stronger. 
<p>We last met in an hour of shock and suffering. In 4 short months, our Nation has comforted the victims, 
begun to rebuild New York and the Pentagon, rallied a great coalition, captured, arrested, and 
rid the world of thousands of terrorists, destroyed Afghanistan's terrorist training camps, 
saved a people from starvation, and freed a country from brutal oppression. 
<p>The American flag flies again over our Embassy in Kabul. Terrorists who once occupied 
Afghanistan now occupy cells at Guantanamo Bay. And terrorist leaders who urged followers to 
sacrifice their lives are running for their own. 

它繼續像這樣一段時間,有多個段落標記。我試圖提取範圍內的所有文本。

我嘗試了幾種不同的方式來獲取文本,但都沒有得到我想要的文本。

首先我想是:

import urllib2,sys 
from BeautifulSoup import BeautifulSoup, NavigableString 

address = 'http://www.presidency.ucsb.edu/ws/index.php?pid=29644&st=&st1=#axzz1fD98kGZW' 
html = urllib2.urlopen(address).read() 

soup = BeautifulSoup(html) 
thespan = soup.find('span', attrs={'class': 'displaytext'}) 
print thespan.string 

這給了我:

議長先生,切尼副總統,國會議員,貴賓們,同胞們:當我們聚集在這裏,我們的國家處於戰爭狀態;我們的經濟處於衰退之中;文明世界面臨前所未有的危險。然而,我們的聯盟狀態從未如此強大。

這是直到第一段落標籤爲止的部分文本。然後我想:

import urllib2,sys 
from BeautifulSoup import BeautifulSoup, NavigableString 

address = 'http://www.presidency.ucsb.edu/ws/index.php?pid=29644&st=&st1=#axzz1fD98kGZW' 
html = urllib2.urlopen(address).read() 

soup = BeautifulSoup(html) 
thespan = soup.find('span', attrs={'class': 'displaytext'}) 
for section in thespan: 
    paragraph = section.findNext('p') 
    if paragraph and paragraph.string: 
     print '>', paragraph.string 
    else: 
     print '>', section.parent.next.next.strip() 

這給我的第一段落標記和第二段落標記之間的文本。所以,我正在尋找一種方式來獲取整個文本,而不是僅僅是部分。

回答

8
import urllib2,sys 
from BeautifulSoup import BeautifulSoup 

address = 'http://www.presidency.ucsb.edu/ws/index.php?pid=29644&st=&st1=#axzz1fD98kGZW' 
soup = BeautifulSoup(urllib2.urlopen(address).read()) 

span = soup.find("span", {"class":"displaytext"}) # span.string gives you the first bit 
paras = [x.contents[0] for x in span.findAllNext("p")] # this gives you the rest 
# use .contents[0] instead of .string to deal with last para that's not well formed 

print "%s\n\n%s" % (span.string, "\n\n".join(paras)) 

正如在評論中指出,上述不工作這麼好,如果<p>標籤包含多個嵌套的標籤。這可以處理使用:

paras = ["".join(x.findAll(text=True)) for x in span.findAllNext("p")] 

然而,這並不與最後<p>不關閉標籤工作也很好。一個奇怪的解決方法是對待不同的方式。例如:

import urllib2,sys 
from BeautifulSoup import BeautifulSoup 

address = 'http://www.presidency.ucsb.edu/ws/index.php?pid=29644&st=&st1=#axzz1fD98kGZW' 
soup = BeautifulSoup(urllib2.urlopen(address).read()) 
span = soup.find("span", {"class":"displaytext"}) 
paras = [x for x in span.findAllNext("p")] 

start = span.string 
middle = "\n\n".join(["".join(x.findAll(text=True)) for x in paras[:-1]]) 
last = paras[-1].contents[0] 
print "%s\n\n%s\n\n%s" % (start, middle, last) 
+0

這不適用於與問題中鏈接的網頁(即它只會打印第一段 - 而不是整個語音)。 – ekhumoro

+0

@ekhumoro固定 –

+0

@ShawnChin非常感謝!這工作完美。 – user1074057

2

以下是如何將與lxml來完成:

import lxml.html as lh 

tree = lh.parse('http://www.presidency.ucsb.edu/ws/index.php?pid=29644&st=&st1=#axzz1fD98kGZW') 

text = tree.xpath("//span[@class='displaytext']")[0].text_content() 

另外,在回答這個問題介紹瞭如何使用beautifulsoup來實現同樣的事情:BeautifulSoup - easy way to to obtain HTML-free contents

從接受答案的輔助函數:

def textOf(soup): 
    return u''.join(soup.findAll(text=True)) 
+1

也許讓運知道爲什麼LXML是一個很好的替代BeautifulSoup :) –

+0

這些都不建議將產生輸出的問題提出的要求。 – ekhumoro

+0

@ekhumoro,你能解釋我的解決方案不能產生所需輸出的方式嗎? OP希望'「...提取跨度內的所有文本」',這就是上面的代碼的作用。 – Acorn

0

你應該嘗試:

soup.span.renderContents() 
+0

'.renderContents()'不會做OP想要的。它不會刪除段落標籤。 – Acorn