使用python進行Web Scraping數據？

我剛開始學習使用Python的網頁抓取。但是，我已經遇到了一些問題。使用python進行Web Scraping數據？

我的目標是網絡廢鋼不同種類的金槍魚從fishbase.org（http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=salmon）名稱

的問題：我無法提取所有的物種名稱。

這是我到目前爲止有：

import urllib2 
from bs4 import BeautifulSoup 

fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Tuna' 
page = urllib2.urlopen(fish_url) 

soup = BeautifulSoup(html_doc) 

spans = soup.find_all(

從這裏，我不知道我怎麼會去提取物種名稱。我想使用正則表達式（即soup.find_all("a", text=re.compile("\d+\s+\d+"))捕捉到標籤內的文本...

任何投入將不勝感激！

來源

2012-03-05 user1248092

在網頁看，我不知道究竟什麼信息要提取但是請注意，您可以使用text屬性很容易地得到在標籤中的文字：。

>>> from bs4 import BeautifulSoup 
>>> html = '<a>some text</a>' 
>>> soup = BeautifulSoup(html) 
>>> [tag.text for tag in soup.find_all('a')] 
[u'some text']

來源

2012-03-05 07:25:47 jcollado

如果你想要一個長期的解決方案，嘗試scrapy這是很簡單，做了很多的爲你工作，它是非常可定製和可擴展的，你將使用xpath提取所有你需要的URL，這是m礦石宜人可靠。如果你需要的話，scrapy還允許你使用re。

來源

2012-03-05 07:56:21 warvariuc

您不妨利用一個事實，即所有的科學名稱（只有學名）是<i/>標籤：

scientific_names = [it.text for it in soup.table.find_all('i')]

使用BS和正則表達式是兩種不同的方法來解析網頁。前者是存在的，所以你不必爲後者打擾太多。

你應該閱讀BS實際做什麼，似乎你低估了它的效用。

來源

2012-03-05 08:20:49 joe

什麼jozek建議是正確的方法，但我無法讓他的片段工作（但這可能是因爲我沒有運行BeautifulSoup 4測試版）。什麼工作對我來說是：

import urllib2 
from BeautifulSoup import BeautifulSoup 

fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Tuna' 
page = urllib2.urlopen(fish_url) 

soup = BeautifulSoup(page) 

scientific_names = [it.text for it in soup.table.findAll('i')] 

print scientific_names

來源

2012-03-05 09:09:12 BioGeek

事實上'findAll'已更名爲'find_all'到符合pep8標準。更多信息[這裏]（http://www.crummy.com/software/BeautifulSoup/bs4/doc/#method-names）。 – jcollado 2012-03-05 09:13:48

謝謝大家......我能解決我這個代碼有問題：

import urllib2 
from bs4 import BeautifulSoup 

fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Salmon' 
page = urllib2.urlopen(fish_url) 
html_doc = page.read() 
soup = BeautifulSoup(html_doc) 

scientific_names = [it.text for it in soup.table.find_all('i')] 

for item in scientific_names: 
print item

來源

2012-03-05 19:02:41 user1248092

不要忘記接受最能幫助你的答案，作爲正確答案。 – BioGeek 2012-03-09 13:48:47

...所以將喬的答案標記爲正確的答案是恰當的......這有助於防止人們跳入答案，思考沒有人爲你做出的決定。 – CLaFarge 2015-06-23 21:59:39

使用python進行Web Scraping數據？

回答

相關問題