當你到達最後一頁,按鈕被禁用:
<a data-pagina="2" href="?ss=4da73052cb8296b5&st=G1&q=incerteza+pol%C3%ADtica+economia&cat=a&species=not%C3%ADcias&page=2"
class="proximo fundo-cor-produto"> próximo</a>
^^^^
# ok
<a data-pagina="41" href="?ss=4da73052cb8296b5&st=G1&q=incerteza+pol%C3%ADtica+economia&cat=a&species=not%C3%ADcias&page=41"
class="proximo disabled">próximo</>
^^^^
# no more next pages
所以纔不斷循環,直到然後:
from bs4 import BeautifulSoup
import requests
from itertools import count
page_count = count(1)
soup = BeautifulSoup(requests.get(url.format(next(page_count))).content)
disabled = soup.select_one("#paginador ul li a.proximo.disabled")
print([a["href"] for a in soup.select("div.busca-materia-padrao a")])
print(soup.select_one("a.proximo.disabled"))
while not disabled:
soup = BeautifulSoup(requests.get(url.format(next(page_count))).content)
disabled = soup.select_one("#paginador ul li a.proximo.disabled")
print([a["href"] for a in soup.select("div.busca-materia-padrao a")])
如果你正在使用請求想檢查你是否d被重定向你可以訪問.history
屬性:
In [1]: import requests
In [2]: r = requests.get("http://g1.globo.com/busca/?q=incerteza%20pol%C3%ADtica%20economia&cat=a&ss=4da73052cb8296b5&st=G1&species=not%C3%ADcias&page=5000")
In [3]: print(r.history)
[<Response [301]>]
In [4]: r.history[0].status_code == 301
Out[4]: True
使用請求將禁止重定向和趕上301返回碼的另一種方式。
soup = BeautifulSoup(requests.get(url.format(next(page_count))).content)
print([a["href"] for a in soup.select("div.busca-materia-padrao a")])
while True:
r = requests.get(url.format(next(page_count)), allow_redirects=False)
if r.status_code == 301:
break
soup = BeautifulSoup(r.content)
print([a["href"] for a in soup.select("div.busca-materia-padrao a")])
我認爲你的邏輯是正確的,但while條件不會停止代碼,當它達到頁數。 –
@ThalesMarques,是的,我在我的選擇器中有一個錯字,它現在可以正常工作 –
第二個代碼仍然在最後一頁之後循環,但最後一個代碼工作正常。我會努力工作。非常感謝你! –