2015-04-28 81 views
1

我很新的編程和Python和我嘗試編寫這個簡單的刮刀在此頁面中提取治療師的所有個人資料的網址無法識別鏈接類

http://www.therapy-directory.org.uk/search.php?search=Sheffield&services[23]=1&business_type[individual]=1&distance=40&uqs=626693

import requests 
from bs4 import BeautifulSoup 

def tru_crawler(max_pages): 
    p = '&page=' 
    page = 1 
    while page <= max_pages: 
    url = 'http://www.therapy-directory.org.uk/search.php?search=Sheffield&distance=40&services[23]=on&services=23&business_type[individual]=on&uqs=626693' + p + str(page) 
    code = requests.get(url) 
    text = code.text 
    soup = BeautifulSoup(text) 
    for link in soup.findAll('a',{'member-summary':'h2'}): 
     href = 'http://www.therapy-directory.org.uk' + link.get('href') 
     yield href + '\n' 
     print(href) 
    page += 1 

現在,當我運行這個代碼,我什麼也沒有,主要是因爲soup.findall是空的。

個人資料鏈接的HTML顯示

<div class="member-summary"> 
<h2 class=""> 
<a href="/therapists/julia-church?uqs=626693">Julia Church</a> 
</h2> 

所以我不知道在soup.findall通過(「A」),以獲得個人資料的網址

請幫什麼額外的參數

感謝

更新 -

我跑了修改後的代碼和好吧,這一次它刮掉第1頁之後返回了一堆錯誤

Traceback (most recent call last): 
File "C:/Users/PB/PycharmProjects/crawler/crawler-revised.py", line 19,  enter code here`in <module> 
tru_crawler(3) 
File "C:/Users/PB/PycharmProjects/crawler/crawler-revised.py", line 9, in tru_crawler 
code = requests.get(url) 
File "C:\Python27\lib\requests\api.py", line 68, in get 
return request('get', url, **kwargs) 
File "C:\Python27\lib\requests\api.py", line 50, in request 
response = session.request(method=method, url=url, **kwargs) 
File "C:\Python27\lib\requests\sessions.py", line 464, in request 
resp = self.send(prep, **send_kwargs) 
File "C:\Python27\lib\requests\sessions.py", line 602, in send 
history = [resp for resp in gen] if allow_redirects else [] 
File "C:\Python27\lib\requests\sessions.py", line 195, in resolve_redirects 
allow_redirects=False, 
File "C:\Python27\lib\requests\sessions.py", line 576, in send 
r = adapter.send(request, **kwargs) 
File "C:\Python27\lib\requests\adapters.py", line 415, in send 
raise ConnectionError(err, request=request) 
requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",)) 

的什麼事錯在這裏它返回一串錯誤

的?

回答

1

目前,參數findAll()你沒有意義。它讀取:找到所有<a>具有member-class屬性等於「h2」。

一種可能的方式是使用select()方法傳遞CSS selector作爲參數:

for link in soup.select('div.member-summary h2 a'): 
    href = 'http://www.therapy-directory.org.uk' + link.get('href') 
    yield href + '\n' 
    print(href) 

以上CSS選擇讀取:找出具有類<div>標籤等於「構件-摘要」,則該<div>內找到<h2>標籤,則內即<h2>找到<a>標記。

工作例如:

import requests 
from bs4 import BeautifulSoup 

p = '&page=' 
page = 1 
url = 'http://www.therapy-directory.org.uk/search.php?search=Sheffield&distance=40&services[23]=on&services=23&business_type[individual]=on&uqs=626693' + p + str(page) 
code = requests.get(url) 
text = code.text 
soup = BeautifulSoup(text) 
for link in soup.select('div.member-summary h2 a'): 
    href = 'http://www.therapy-directory.org.uk' + link.get('href') 
    print(href) 

輸出(修剪,來自26個鏈接):

http://www.therapy-directory.org.uk/therapists/lesley-lister?uqs=626693 
http://www.therapy-directory.org.uk/therapists/fiona-jeffrey?uqs=626693 
http://www.therapy-directory.org.uk/therapists/ann-grant?uqs=626693 
..... 
..... 
http://www.therapy-directory.org.uk/therapists/jan-garbutt?uqs=626693 
+0

感謝這個,但它仍然不返回任何東西:( –

+0

@pb_ng嗯..爲我工作(一連串的鏈接打印)看到更新的答案我是如何嘗試 – har07

+0

謝謝,所以刪除「yield href +'\ n」使它工作如果你不介意我問,爲什麼這樣當使用Yield時,它沒有返回任何東西? –