使用BeautifulSoup和Python從HTML文件中提取數據

我需要從HTML文件中提取數據。有問題的文件很可能是自動生成的。我已將其中一個文件的代碼上傳到Pastebin：http://pastebin.com/9Nj2Edfv。這是指向實際頁面的鏈接：http://eur-lex.europa.eu/Notice.do?checktexts=checkbox&val=60504%3Acs&pos=1&page=1&lang=en&pgs=10&nbl=1&list=60504%3Acs%2C&hwords=&action=GO&visu=%23texte 使用BeautifulSoup和Python從HTML文件中提取數據

我需要提取的數據可以在不同的標題下找到。

這是我到目前爲止有：

from BeautifulSoup import BeautifulSoup 
ecj_data = open("data\ecj_1.html",'r').read() 

soup = BeautifulSoup(ecj_data) 

celex = soup.find('h1') 
auth_lang = soup('ul', limit=14)[13].li 
procedure = soup('ul', limit=20)[17].li 

print "Celex number:", celex.renderContents(), 
print "Authentic language:", auth_lang 
print "Type of procedure:", procedure

我把所有的數據存儲在本地是它打開文件ecj_1.html的原因。

Celex號碼和Authentic語言的作品有點不錯。

CELEX返回

"Celex number: 
61977J0059"

auth_lang返回"Authentic language: <li>French</li>"

我需要h1標籤（未在年底突破）的內容之外。

[此外，我需要auth_lang返回只是「法國」，而不是<li> - 標籤。] 這不再是一個問題。我意識到我可以在「auth_lang」的末尾添加「.text」。在另一方面

過程返回此：

Type of procedure: <li> 
    <strong>Type of procedure:</strong> 
    <br /> 
    Reference for a preliminary ruling 
    </li>

這是相當錯誤的，因爲我只需要它返回「參考了初步裁決」。

有什麼辦法可以實現這個目標嗎？

第二個編輯：我換成celex = soup.find('h1')與celex = soup('h1', limit=2)[0]，並添加.text到打印CELEX。

來源

2012-03-20 A2D2

找到的每個序列的內容都是列表，只有前兩個是長度1.但是procedure是5個元素長，並且您在此之後（在這種情況下）的條目是第4個。我用splitlines()來擺脫換行符。

print "Celex number:", celex.contents[0].splitlines()[1] 
print "Authentic language:", auth_lang.contents[0].splitlines()[0] 
print "Type of procedure:", procedure.contents[4].splitlines()[1]

輸出：

Celex number: 61977J0059 
Authentic language: French 
Type of procedure: Reference for a preliminary ruling

來源

2012-03-20 14:42:31 fraxel

飛梭：非常感謝你！它像一個魅力。這個想法是以某種方式將此文件的輸出傳輸到數據庫。我相信當你向我展示如何擺脫換行符時，你可能已經解決了將來的問題，因爲他們可能會在稍後解決問題。再次感謝！ – A2D2 2012-03-20 14:45:44

使用BeautifulSoup和Python從HTML文件中提取數據

回答

相關問題