我試圖從使用BeautifulSoup網站刮講話。然而,我遇到了問題,因爲演講分爲許多不同的段落。我對編程非常陌生,無法解決如何處理這個問題。該頁面的HTML看起來像這樣:刮與BeautifulSoup和多個段落
<span class="displaytext">Thank you very much. Mr. Speaker, Vice President Cheney,
Members of Congress, distinguished guests, fellow citizens: As we gather tonight, our Nation is
at war; our economy is in recession; and the civilized world faces unprecedented dangers.
Yet, the state of our Union has never been stronger.
<p>We last met in an hour of shock and suffering. In 4 short months, our Nation has comforted the victims,
begun to rebuild New York and the Pentagon, rallied a great coalition, captured, arrested, and
rid the world of thousands of terrorists, destroyed Afghanistan's terrorist training camps,
saved a people from starvation, and freed a country from brutal oppression.
<p>The American flag flies again over our Embassy in Kabul. Terrorists who once occupied
Afghanistan now occupy cells at Guantanamo Bay. And terrorist leaders who urged followers to
sacrifice their lives are running for their own.
它繼續像這樣一段時間,有多個段落標記。我試圖提取範圍內的所有文本。
我嘗試了幾種不同的方式來獲取文本,但都沒有得到我想要的文本。
首先我想是:
import urllib2,sys
from BeautifulSoup import BeautifulSoup, NavigableString
address = 'http://www.presidency.ucsb.edu/ws/index.php?pid=29644&st=&st1=#axzz1fD98kGZW'
html = urllib2.urlopen(address).read()
soup = BeautifulSoup(html)
thespan = soup.find('span', attrs={'class': 'displaytext'})
print thespan.string
這給了我:
議長先生,切尼副總統,國會議員,貴賓們,同胞們:當我們聚集在這裏,我們的國家處於戰爭狀態;我們的經濟處於衰退之中;文明世界面臨前所未有的危險。然而,我們的聯盟狀態從未如此強大。
這是直到第一段落標籤爲止的部分文本。然後我想:
import urllib2,sys
from BeautifulSoup import BeautifulSoup, NavigableString
address = 'http://www.presidency.ucsb.edu/ws/index.php?pid=29644&st=&st1=#axzz1fD98kGZW'
html = urllib2.urlopen(address).read()
soup = BeautifulSoup(html)
thespan = soup.find('span', attrs={'class': 'displaytext'})
for section in thespan:
paragraph = section.findNext('p')
if paragraph and paragraph.string:
print '>', paragraph.string
else:
print '>', section.parent.next.next.strip()
這給我的第一段落標記和第二段落標記之間的文本。所以,我正在尋找一種方式來獲取整個文本,而不是僅僅是部分。
這不適用於與問題中鏈接的網頁(即它只會打印第一段 - 而不是整個語音)。 – ekhumoro
@ekhumoro固定 –
@ShawnChin非常感謝!這工作完美。 – user1074057