如何從有加載表的網站上進行網絡抓取？

我試圖從Python 2.7的網站上抓取網頁，其中有一個表格需要加載。如果我試圖網絡抓取它，我只會得到：「正在加載」或「對不起，我們沒有任何有關它的信息」，因爲它必須先加載..如何從有加載表的網站上進行網絡抓取？

我閱讀了一些文章和代碼，但沒有任何工作。

我的代碼：

import urllib2, sys 
 
from BeautifulSoup import BeautifulSoup 
 
import json 
 

 
site= "https://www.flightradar24.com/data/airports/bud/arrivals" 
 
hdr = {'User-Agent': 'Mozilla/5.0'} 
 
req = urllib2.Request(site,headers=hdr) 
 
page = urllib2.urlopen(req) 
 
soup = BeautifulSoup(page) 
 
nev = soup.find('h1' , attrs={'class' : 'airport-name'}) 
 
print nev 
 

 
table = soup.find('div', { "class" : "row cnt-schedule-table" }) 
 
print table

import urllib2 
 
from bs4 import BeautifulSoup 
 
import json 
 

 
# new url  
 
url = 'https://www.flightradar24.com/data/airports/bud/arrivals' 
 

 
# read all data 
 
page = urllib2.urlopen(url).read() 
 

 
# convert json text to python dictionary 
 
data = json.loads(page) 
 

 
print(data['row cnt-schedule-table'])

來源

2017-07-25 tardos93

該數據通常由ajax加載，有時來自javascript的變量。您需要查找來源並從中獲取信息。 – VMRuiz

使用像fiddler，charles proxy之類的工具。對於這個例子，這是你的ajax API調用https://api.flightradar24.com/common/v1/airport.json?code=bud&plugin [] =＆插件設置[日程安排] [模式] =到達和插件設置[日程安排] [ timestamp] = 1500966512＆page = 2＆limit = 50＆token = – Aki003

這個鏈接對我來說不是好主意，因爲有些信息會以這種方式丟失。 – tardos93

我也面臨這個問題..你可以使用Python硒包。我們需要等待加載你的表，所以我使用time.sleep（），但這是不正確的方法。你可以使用wait.until（「元」）方法PFB示例代碼登錄

from bs4 import BeautifulSoup 
from selenium import webdriver 
import time 
profile=webdriver.FirefoxProfile() 
profile.set_preference("intl.accept_languages","en-us") 
driver = webdriver.Firefox(firefox_profile=profile) 
driver.get("https://www.flightradar24.com/data/airports/bud/arrivals") 
time.sleep(10) 
html_source=driver.page_source 
soup=BeautifulSoup(html_source,"html.parser") 
print soup

參考鏈接。

Selenium waitForElement

來源

2017-07-25 07:15:36 karnaf

如果我使用time.sleep有風險嗎？ time.sleep（10）就夠了，還是取決於硬件和互聯網連接？ – tardos93

雅我知道這就是爲什麼我提到這不是核心的方式...所以我們可以使用硒api wait.untill（）這種方法等待表內容（表元素）填充.. – karnaf

嗯。我嘗試在「page = urllib2 ....」下插入這個time.sleep，但我得到了這個錯誤信息：webdriver.Firefox.implicitly_wait（30） TypeError：必須使用WebDriver實例調用unbound方法implicitly_wait（）參數（改爲int實例）這是代碼：webdriver.Firefox.implicitly_wait（30） – tardos93

如何從有加載表的網站上進行網絡抓取？

回答

相關問題