Python刮href iinks

我的目標是刮在base_url網站上的href鏈接。Python刮href iinks

我的代碼：

from bs4 import BeautifulSoup 
from selenium import webdriver 
import requests, csv, re 

game_links = [] 
link_pages = [] 
base_url = "http://www.basket.fi/sarjat/ohjelma_tulokset/?season_id=93783&league_id=4#mbt:2-303$f&stage=177155:$p&0=" 


browser = webdriver.PhantomJS() 
browser.get(base_url) 
table = BeautifulSoup(browser.page_source, 'lxml') 
for game in table.find_all("a", {'game_id': re.compile('\d+')}): 
    href=game.get("href") 
    print(href)

結果：

http://www.basket.fi/sarjat/ottelu/?game_id=3502579&season_id=93783&league_id=4 
http://www.basket.fi/sarjat/ottelu/?game_id=3502579&season_id=93783&league_id=4 
http://www.basket.fi/sarjat/ottelu/?game_id=3502523&season_id=93783&league_id=4 
http://www.basket.fi/sarjat/ottelu/?game_id=3502523&season_id=93783&league_id=4 

......

的問題是，我不明白爲什麼在結果中的href鏈接會始終兩次？

來源

2017-08-22 Juho M

的鏈接可以在頁面中出現兩次？你可以使用'set（）'過濾雙打（humm，不確定它使用標記對象...） – PRMoureu

As you Notice in the image there are same game_id for two links

修改代碼： This would help you to get only one link

for game in table.find_all("a", {'game_id': re.compile('\d+')}): 
    if game.children: 
     href=game.get("href") 
     print(href)

來源

2017-08-22 11:38:48

Python刮href iinks

回答

相關問題