2015-11-03 119 views
1

我試圖從本網站刮緯度經度&數的緯度/經度數據:如何使用beautifulsoup刮從HTML頁面

http://www.healthgrades.com/provider-search-directory/search?q=Dentistry&prof.type=provider&search.type=&method=&loc=New+York+City%2C+NY+&pt=40.71455%2C-74.007118&isNeighborhood=&locType=%7Cstate%7Ccity&locIsSolrCity=false 

對於每一個供應商,如果你看的元素,它看起來像

div class="listing" data-lat="40.66862" data-lng="-73.98574" data-listing="22" 

我怎樣才能使用beautifulsoup緯度和經度這裏數?

我試圖用正則表達式在我的劇本,

下面是我的腳本 -

Geo = soup.find("div", class_="providerSearchResults") 
print Geo.findAll("div", data-lat_= re.compile('[0-9.]')) 

但我得到這個錯誤信息:「語法錯誤:關鍵字不能是一個表達式」

此外,每個供應商的「格」部分的變化總是 它可以是:

div class="listing" data-lat="40.66862" data-lng="-73.98574" data-listing="22" 

div class="listingfirst" data-lat="40.66862" data-lng="-73.98574" data-listing="22" 

甚至

div class="listing enhancedlisting" data-lat="40.66862" data-lng="-73.98574" data-listing="22" 
+1

Python正則表達式包(['re'](https://docs.python.org/3.5/library/re.html))沒有屬性/方法'.find',這就是爲什麼你'重新得到那個錯誤。 – Rejected

回答

1

第一點有幾個要求:

pip install requests 
pip install BeautifulSoup 
pip install lxml 

latlongbs4.py:

import requests 
from bs4 import BeautifulSoup 

r = requests.get('http://www.healthgrades.com/provider-search-directory/search?q=Dentistry&prof.type=provider&search.type=&method=&loc=New+York+City%2C+NY+&pt=40.71455%2C-74.007118&isNeighborhood=&locType=%7Cstate%7Ccity&locIsSolrCity=false') 
soup = BeautifulSoup(r.text, 'lxml') 
latlonglist = soup.find_all(attrs={"data-lat": True, "data-lng": True}) 
for latlong in latlonglist: 
    print latlong['data-lat'], latlong['data-lng'] 

編輯:從attrs詞典中刪除了class

輸出:

(latlongbs4)macbook:latlongbs4 joeyoung$ python latlongbs4.py 
40.71851 -74.00984 
40.77536 -73.97707 
40.71961 -74.00347 
40.71395 -74.008 
40.711614 -74.015901 
40.724576 -74.001771 
40.7175 -74.00087 
40.71961 -74.00347 
40.71766 -73.99293 
40.71961 -74.00347 
40.71848 -73.99648 
40.709917 -74.009884 
40.71553 -74.00977 
40.71702 -73.996 
40.71254 -73.99994 
40.70869 -74.01164 
40.70994 -74.00764 
40.707325 -74.003982 
40.7184 -74.00098 
40.71373 -74.00812 
40.710474 -74.009844 
40.7175 -74.00087 
40.727582 -73.894632 
40.763469 -73.963106 
40.724853 -73.841097 

的幾個注意事項:

我用attrs關鍵字與字典,因爲:

Some attributes, like the data-* attributes in HTML 5, have names that can’t be used as the names of keyword arguments:

You can use these attributes in searches by putting them into a dictionary and passing the dictionary into find_all() as the attrs argument:

來源:http://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-keyword-arguments

+0

我只是意識到使用這段代碼有一個問題。正如我所說的,div之後的關鍵字從提供者變爲提供者。所以如果我只使用div class =「listing」,我會錯過一些提供者。 – backpackerice

+0

只要div仍包含'data-lat'和'data-lng'屬性,就可以從字典中取出''class「:」listing「。當我在網址上試用它時,我沒有看到任何類似的情況。 –

+0

你可以在我原來的問題中找到更多細節。此外,我試圖使用正則表達式「列表」,例如「^ listing. *」。但是,這會給我一些無用的數據,如div class = listingInner或div class = listingBody – backpackerice