如何使用Beautiful Soup從HTML文檔獲取純文本和URL？

我正在使用Python和正則表達式來查找HTML文檔，而不像大多數人所說的那樣，它完美地工作，即使事情可能出錯。無論如何，我決定美麗湯會更快，更容易，但我真的不知道如何讓它做我做的正則表達式，這很容易，但很混亂。如何使用Beautiful Soup從HTML文檔獲取純文本和URL？

我使用這個頁面的HTML：

http://www.locationary.com/places/duplicates.jsp?inPID=1000000001

編輯：

下面是主要場所的HTML：

<tr> 
<td class="Large Bold" nowrap="nowrap">Riverside Tower Hotel&nbsp;</td> 
<td class="Large Bold" width="100%">80 Riverside Drive, New York, New York, United States</td> 
<td class="Large Bold" nowrap="nowrap" width="55">&nbsp;<input name="selectCheckBox" type="checkbox" checked="checked" disabled="disabled" />Yes 
</td> 
</tr>

第一近似代替實例：

<td class="" nowrap="nowrap"><a href="http://www.locationary.com/place/en/US/New_York/New_York/54_Riverside_Dr_Owners_Corp-p1009633680.jsp" target="_blank">54 Riverside Dr Owners Corp</a></td> 
<td width="100%">&nbsp;54 Riverside Dr, New York, New York, United States</td> 
<td nowrap="nowrap" width="55">

當我的程序得到它並且使用Beautiful Soup使它更具可讀性時，HTML出現與Firefox的「查看源」有點不同......我不知道爲什麼。

這些都是我的正則表達式：

PlaceName = re.findall(r'"nowrap">(.*)&nbsp;</td>', main) 

PlaceAddress = re.findall(r'width="100%">(.*)</td>\n<td class="Large Bold"', main) 

cNames = re.findall(r'target="_blank">(.*)</a></td>\n<td width="100%">&nbsp;', main) 

cAddresses = re.findall(r'<td width="100%">&nbsp;(.*)</td>\n<td nowrap="nowrap" width="55">', main) 

cURLs = re.findall(r'<td class="" nowrap="nowrap"><a href="(.*)" target="_blank">', main)

前兩個是主要的地方和地址。其餘的是爲其他地方的信息。在做完這些之後，我決定我只想要cNames，cAddresses和cURLs的前5個結果，因爲我不需要91或其他任何東西。

我不知道如何用BS找到這種信息。我可以用BS做的所有事情都是找到特定的標籤並用它們做事。這個HTML有點複雜，因爲所有的信息。我想要的是在桌子和桌子標籤也是一種混亂...

你如何得到這些信息，並只限於前5個結果呢？

謝謝。

來源

2012-08-10 Marcus Johnson

請在這裏包括您的問題的HTML的相關部分對未來的讀者有用。 – 2012-08-10 13:44:23

沒有通向HTML解析的道路。這意味着你必須花一些時間學習一些解析器，而BeautifulSoup是更容易的解析器之一。你真的不能用正則表達式來欺騙任務。 http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454真的。 – msw 2012-08-10 14:29:18

人們說，你不能解析使用正則表達式HTML是有原因的，但這裏有適用於您的正則表達式，原因很簡單：你必須在你的正則表達式\n和 和那些能在將變化隨機在您嘗試解析的頁面上。發生這種情況時，您的正則表達式不匹配，您的代碼將停止工作。

但是，你正在尋找做任務是非常簡單的

from BeautifulSoup import BeautifulSoup 
soup = BeautifulSoup(open('this-stackoverflow-page.html')) 

for anchor in soup('a'): 
    print anchor.contents, anchor.get('href')

得到所有的錨標籤，他們出現在該頁面的深層嵌套結構不管。下面是我行從三個行腳本的輸出摘錄：

[u'Stack Exchange'] http://stackexchange.com 
[u'msw'] /users/282912/msw 
[u'faq'] /faq 
[u'Stack Overflow']/
[u'Questions'] /questions 
[u'How to use Beautiful Soup to get plaintext and URLs from an HTML document?'] /questions/11902974/how-to-use-beautiful-soup-to-get-plaintext-and-urls-from-an-html-document 
[u'http://www.locationary.com/places/duplicates.jsp?inPID=1000000001'] http://www.locationary.com/places/duplicates.jsp?inPID=1000000001 
[u'python'] /questions/tagged/python 
[u'beautifulsoup'] /questions/tagged/beautifulsoup 
[u'Marcus Johnson'] /users/1587751/marcus-johnson

這是很難想象更少的代碼，可以做許多工作適合你。

來源

2012-08-10 15:05:03 msw

如何使用Beautiful Soup從HTML文檔獲取純文本和URL？

回答

相關問題