2016-07-16 53 views
-1

我有一個腳本,我寫了,我用美麗的湯刮搜索結果的網站。我設法通過類名來隔離我想要的數據。如何通過美麗的湯刮網頁迭代通過多個結果頁

但是,搜索結果不在單個頁面上。相反,它們分佈在多個頁面上,所以我想讓它們全部。我想讓我的腳本能夠檢查是否有下一個結果頁面,並在那裏運行。由於結果數量不同,我不知道有多少頁面的結果存在,所以我不能預先定義一個範圍來迭代。我也嘗試使用'if_page_exists'檢查。但是,如果我把一個頁面編號超出結果範圍,頁面總是存在,它只是沒有任何結果,但有一個頁面,說沒有結果要顯示。

然而,我注意到,每個頁面結果都有一個'Next'鏈接,其id爲'NextLink1',最後一頁結果沒有這個。所以我認爲那可能是魔法。但我不知道如何以及在哪裏實施該檢查。我一直在獲得無限循環和東西。

下面的腳本查找搜索項「x」的結果。援助將不勝感激。

from urllib.request import urlopen 
from bs4 import BeautifulSoup 

#all_letters = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o","p","q","r","s","t","u","v", "w", "x", "y", "z", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9"] 
all_letters= ['x'] 
for letter in all_letters: 

    page_number = 1 
    url = "https://www.co.dutchess.ny.us/CountyClerkDocumentSearch/Search.aspx?q=nco1%253d2%2526name1%253d" + letter + "&page=" + str (page_number) 
    html = urlopen(url) 
    bsObj = BeautifulSoup(html) 
    nameList = bsObj.findAll("td", {"class":"party-name"}) 

    for name in nameList: 
     print(name.get_text()) 

而且,沒有人知道實例的字母數字字符的名單綜合類更好的比我上面的腳本註釋掉的一個較短的方法嗎?

+0

所以基本上你想'如果bsObj.find('a',id ='NextLink1'):page_number + = 1'? –

+0

問題是我實例化bsObj之前創建並解析了我的url。所以我不知道如何我可以更改網址後,我已經做了這個檢查 –

+0

請**不要轉發問題** [我怎樣才能使網頁刮板遍歷使用美麗的湯多個搜索結果?](http:///stackoverflow.com/questions/38364642/how-can-i-make-a-web-scraper-traverse-multiple-pages-of-search-results-using-bea) –

回答

0

試試這個:

from urllib.request import urlopen 
from bs4 import BeautifulSoup 


#all_letters = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o","p","q","r","s","t","u","v", "w", "x", "y", "z", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9"] 
all_letters= ['x'] 
pages = [] 

def get_url(letter, page_number): 
    return "https://www.co.dutchess.ny.us/CountyClerkDocumentSearch/Search.aspx?q=nco1%253d2%2526name1%253d" + letter + "&page=" + str (page_number) 

def list_names(soup): 
    nameList = soup.findAll("td", {"class":"party-name"}) 
    for name in nameList: 
     print(name.get_text()) 

def get_soup(letter, page): 
    url = get_url(letter, page) 
    html = urlopen(url) 
    return BeautifulSoup(html) 

def main(): 
    for letter in all_letters: 
     bsObj = get_soup(letter, 1) 

     sel = bsObj.find('select', {"name": "ctl00$ctl00$InternetApplication_Body$WebApplication_Body$SearchResultPageList1"})  
     for opt in sel.findChildren("option", selected = lambda x: x != "selected"): 
      pages.append(opt.string) 

     list_names(bsObj) 

     for page in pages: 
      bsObj = get_soup(letter, page) 
      list_names(bsObj) 
main() 

main()功能,從get_soup(letter, 1)第一頁,我們發現,存放在一個列表,其中包含所有頁碼選擇選項的值。

接下來,我們循環頁碼以從下一頁提取數據。