從網站抓取數據/表單，我試過機械化和硒，都失敗了。抓取網頁，但需要javascript查看頁面內容

機械化

腳本看起來像下面，

import sys 
import mechanize 
url ='xxx' 
response2=br.open(url) 
request = br.request 
print (response2.info()) 
print (response2.read())

輸出：

Cache-Control: no-store, must-revalidate, no-cache, max-age=0 
Content-Type: text/html 
Connection: close 
Vary: Accept-Encoding 
Pragma: no-cache 
Expires: -1 
CacheControl: no-cache 
X-UA-Compatible: IE=edge 
Content-Type: text/html; charset=utf-8 

... more content ... 

<noscript>Please enable JavaScript to view the page content.</noscript> 
</head><body> 
</body></html>

硒

所以我想也許我可以硒運行JS，像

from selenium import webdriver 
from selenium.webdriver.common.keys import Keys 

driver = webdriver.Firefox() 
url= 'xxx' 
driver.get(url) 

print driver.context 
print driver.title 

print driver.page_source 
driver.close()

，但我又失敗了，結果幾乎是一樣的：

... 
<noscript>Please enable JavaScript to view the page content.</noscript> 
...

我只想獲取從網站正確的內容/形式，和submit或post的數據/表格服務器來模擬瀏覽器的訪問行爲。

我現在沒有想法，我不知道硒是如何工作的，並且等待你的幫助，提前致謝。

來源

2017-06-15 tim

對不起，忘了URL，URL是'的https：//onlineservices.immigration.govt.nz/ WHS' – tim

你可以嘗試添加this..profile = webdriver.FirefoxProfile（）..簡介.set_preference（「javascript.enabled」，True..broswer = webdriver.Firefox（profile） –

當我訪問該頁面時，他們向我展示了一個圖像代碼以防止非人類訪問者。顯然，他們不希望您獲取那個數據 – codeiscool

試試這個：
使用下面的配置文件啓用閃存。

from selenium.webdriver.firefox.firefox_profile import FirefoxProfile 

firefoxProfile = FirefoxProfile() 

## Enable Flash 

firefoxProfile.set_preference('dom.ipc.plugins.enabled.libflashplayer.so', 
          'true') 

driver = webdriver.Firefox(firefoxProfile)

如果它仍然無法正常工作使用chromedriver而不是Firefox的，它似乎在默認情況下chromedriver工作。

https://chromedriver.storage.googleapis.com/index.html?path=2.30/

來源

2017-06-15 08:26:08 Stack

抓取網頁，但需要javascript查看頁面內容

機械化

硒

回答

相關問題