如何在一次掃描超過100個谷歌頁面

我正在使用python中的請求庫到GET谷歌搜索結果中的數據。 https://www.google.com.pk/#q=pizza&num=10將返回谷歌的前10個結果，因爲我提到了num=10。最終https://www.google.com.pk/#q=pizza&num=100將返回100個谷歌搜索結果。如何在一次掃描超過100個谷歌頁面

但

如果我寫的任何數量超過100個，讓https://www.google.com.pk/#q=pizza&num=200，谷歌仍是返回前100個結果

我怎樣才能在一個通100餘？

代碼：

import requests 
url = 'http://www.google.com/search' 
my_headers = { 'User-agent' : 'Mozilla/11.0' } 
payload = { 'q' : pizza, 'start' : '0', 'num' : 200 } 
r = requests.get(url, params = payload, headers = my_headers)

在「r」我只得到URL谷歌的的前100個結果，而不是200

來源

2016-01-05 Muhammad Zeeshan

谷歌強制執行限制，每頁最大結果數爲100. – AChampion

任何其他方式....？ –

這不是100頁，它有100個結果。 –

您可以使用更編程API從谷歌得到的結果與努力到屏幕刮的人肉搜索界面，沒有錯誤檢查或斷言這是所有谷歌牛逼&銫的規定，建議您考慮使用此URL的細節：

import requests 

def search(query, pages=4, rsz=8): 
    url = 'https://ajax.googleapis.com/ajax/services/search/web' 
    params = { 
     'v': 1.0,  # Version 
     'q': query, # Query string 
     'rsz': rsz, # Result set size - max 8 
    } 

    for s in range(0, pages*rsz+1, rsz): 
     params['start'] = s 
     r = requests.get(url, params=params) 
     for result in r.json()['responseData']['results']: 
      yield result

例如獲得200個'google'結果：

>>> list(search('google', pages=24, rsz=8)) 
[{'GsearchResultClass': 'GwebSearch', 
    'cacheUrl': 'http://www.google.com/search?q=cache:y14FcUQOGl4J:www.google.com', 
    'content': 'Search the world&#39;s information, including webpages, images, videos and more. \n<b>Google</b> has many special features to help you find exactly what you&#39;re looking\xa0...', 
    'title': '<b>Google</b>', 
    'titleNoFormatting': 'Google', 
    'unescapedUrl': 'https://www.google.com/', 
    'url': 'https://www.google.com/', 
    'visibleUrl': 'www.google.com'}, 
    ... 
]

要使用Google的自定義搜索API，您需要註冊爲開發者。你得到100個免費查詢（我不知道這是API調用或允許同一查詢分頁算作1個查詢）每天：

註冊@https://console.developers.google.com
創建項目
創建key
啓用自定義搜索API
@https://cse.google.com
- 創建自定義搜索引擎使用的虛擬SI TE來初始化CSE
- 編輯CSE搜索整個網絡
- 刪除虛擬網站
獲得CSE參考（看看公衆網址爲cx=<cse reference>）

的你可以使用requests使查詢：

import requests 
url = 'https://www.googleapis.com/customsearch/v1' 
params = { 
    'key': '<key>', 
    'cx': '<cse reference>', 
    'q': '<search>', 
    'num': 10, 
    'start': 1 
} 

resp = requests.get(url, params=params) 
results = resp.json()['items']

隨着start你可以做與上面類似的分頁。

有很多可用的，你可以看一下對CSE的REST文檔其他參數：https://developers.google.com/custom-search/json-api/v1/reference/cse/list#request

谷歌也有一個客戶端API庫：pip install google-api-python-client你也可以使用：

from googleapiclient import discovery 
service = discovery.build('customsearch', 'v1', developerKey='<key>') 
params = { 
    'q': '<query>', 
    'cx': '<cse reference>', 
    'num': 10, 
    'start': 1 
} 
query = service.cse().list(**params) 
results = query.execute()['items']

來源

2016-01-05 17:10:19 AChampion

非常感謝，它會在第一次嘗試中給出結果，之後它不會檢索任何結果，可能是Google開始阻止我的IP？任何解決方案 –

正式的AJAX搜索網址被棄用，實際上這個網址的大多數形式在幾年前被刪除，這一個特定的形式大概是錯過了。有API搜索限制。以編程方式訪問谷歌的官方方式是通過自定義搜索API https://developers.google.com/custom-search/，您需要註冊一個密鑰 - 這有很大的限制，您需要支付如果你想完全訪問。 – AChampion

NP。在您需要開始付款之前，您每天都會獲得CSE 100個查詢。 – AChampion

-1

你可以爲此使用瀏覽器自動化。我用它來刮取圖像清單。使用瀏覽器自動化，您可以單擊下一個或上一個按鈕並取消報告結果。我無法粘貼代碼。

來源

2016-01-27 12:43:07

如何在一次掃描超過100個谷歌頁面

回答

相關問題