使用Selenium和python發佈JavaScript腳本生成的內容抓取

我試圖從本網站中刪除房地產數據：example 正如您所看到的相關內容被放置到文章標記中。使用Selenium和python發佈JavaScript腳本生成的內容抓取

我正在與硒phantomjs：

driver = webdriver.PhantomJS(executable_path=PJSpath)

然後我產生蟒蛇的URL，因爲所有的搜索結果鏈接的一部分，所以我可以搜索什麼，我正在尋找的程序無需填寫表格。

在致電

driver.get(engine_link)

我複製engine_link到剪貼板，並在打開Chrome瀏覽器的罰款。接下來，我等待所有可能的重定向發生：

def wait_for_redirect(wdriver): 
    elem = wdriver.find_element_by_tag_name("html") 
    count = 0 
    while True: 
     count += 1 
     if count > 5: 
      print("Waited for redirect for 5 seconds!") 
      return 
     time.sleep(1) 
     try: 
      elem = wdriver.find_element_by_tag_name("html") 
     except StaleElementReferenceException: 
      return

現在終於我想所有<article>標籤遍歷當前頁面：

for article in driver.find_elements_by_tag_name("article"):

但這個循環永遠不會返回任何東西。該程序沒有找到任何文章標籤，我用xpath和css選擇器試過。而且，這些文章被封在一個章節標籤中，這也是無法找到的。

Selenium中的這種特定類型的標籤有問題嗎？或者我在這裏丟失了與JS有關的東西？在頁面的底部有一些JavaScript模板，其名稱暗示它們會生成搜索結果。

任何幫助表示讚賞！

來源

2016-02-04 Thanados

Pretend not to be PhantomJS並添加Explicit Wait（爲我工作）：

from selenium import webdriver 
from selenium.webdriver import DesiredCapabilities 
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.wait import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC 

# set a custom user-agent 
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.57 Safari/537.36" 
dcap = dict(DesiredCapabilities.PHANTOMJS) 
dcap["phantomjs.page.settings.userAgent"] = user_agent 

driver = webdriver.PhantomJS(desired_capabilities=dcap) 
driver.get("http://www.seloger.com/list.htm?cp=40250&org=advanced_search&idtt=2&pxmin=50000&pxmax=200000&surfacemin=20&surfacemax=100&idtypebien=2&idtypebien=1&idtypebien=11") 

# wait for arcitles to be present 
wait = WebDriverWait(driver, 10) 
wait.until(EC.presence_of_element_located((By.TAG_NAME, "article"))) 

# get articles 
for article in driver.find_elements_by_tag_name("article"): 
    print(article.text)

來源

2016-02-04 21:11:56 alecxe

是的用戶代理的伎倆。謝謝！ – Thanados

使用Selenium和python發佈JavaScript腳本生成的內容抓取

回答

相關問題