2017-06-14 127 views
1

我試圖做一個webcrawler去鏈接,並等待加載的Javascript內容。然後它應該獲得列出的文章的所有鏈接,然後進入下一頁。問題是它總是從第一個網址(「https://techcrunch.com/search/heartbleed」)中刪除,而不是遵循我給它的那些。爲什麼下面的代碼不能從我通過reqeusts傳遞的新url中刪除?我的想法......Python Scrapy - Selenium - 請求下一頁

import scrapy 
from scrapy.http.request import Request 
from selenium import webdriver 
from selenium.webdriver.common.by import By 
from selenium.webdriver.support import expected_conditions as EC 
from selenium.common.exceptions import TimeoutException 
from selenium.webdriver.support.ui import WebDriverWait 
import time 


class TechcrunchSpider(scrapy.Spider): 
    name = "techcrunch_spider_performance" 
    allowed_domains = ['techcrunch.com'] 
    start_urls = ['https://techcrunch.com/search/heartbleed'] 



    def __init__(self): 
     self.driver = webdriver.PhantomJS() 
     self.driver.set_window_size(1120, 550) 
     #self.driver = webdriver.Chrome("C:\Users\Daniel\Desktop\Sonstiges\chromedriver.exe") 
     self.driver.wait = WebDriverWait(self.driver, 5) #wartet bis zu 5 sekunden 

    def parse(self, response): 
     start = time.time()  #ZEITMESSUNG 
     self.driver.get(response.url) 

     #wartet bis zu 5 sekunden(oben definiert) auf den eintritt der condition, danach schmeist er den TimeoutException error 
     try:  

      self.driver.wait.until(EC.presence_of_element_located(
       (By.CLASS_NAME, "block-content"))) 
      print("Found : block-content") 

     except TimeoutException: 
      self.driver.close() 
      print(" block-content NOT FOUND IN TECHCRUNCH !!!") 


     #Crawle durch Javascript erstellte Inhalte mit Selenium 

     ahref = self.driver.find_elements(By.XPATH,'//h2[@class="post-title st-result-title"]/a') 

     hreflist = [] 
     #Alle Links zu den jeweiligen Artikeln sammeln 
     for elem in ahref : 
      hreflist.append(elem.get_attribute("href")) 


     for elem in hreflist : 
      print(elem) 
      yield scrapy.Request(url=elem , callback=self.parse_content) 


     #Den link fuer die naechste seite holen 
     try:  
      next = self.driver.find_element(By.XPATH,"//a[@class='page-link next']") 
      nextpage = next.get_attribute("href") 
      print("JETZT KOMMT NEXT :") 
      print(nextpage) 
      #newresponse = response.replace(url=nextpage) 
      yield scrapy.Request(url=nextpage, dont_filter=False) 

     except TimeoutException: 
      self.driver.close() 
      print(" NEXT NOT FOUND(OR EOF) IM CLOSING MYSELF !!!") 



     end = time.time() 
     print("Time elapsed : ") 
     finaltime = end-start 
     print(finaltime) 


    def parse_content(self, response):  
     title = self.driver.find_element(By.XPATH,"//h1") 
     titletext = title.get_attribute("innerHTML") 
     print(" h1 : ") 
     print(title) 
     print(titletext) 

回答

1

一個第一個問題將是:

for elem in hreflist : 
     print(elem) 
     yield scrapy.Request(url=elem , callback=self.parse_content) 

這個代碼產量scrapy請求中發現的所有鏈接。但是:

def parse_content(self, response):  
    title = self.driver.find_element(By.XPATH,"//h1") 
    titletext = title.get_attribute("innerHTML") 

parse_content函數嘗試使用驅動程序來解析頁面。您可以嘗試使用scrapy的響應元素解析或使用webdriver加載頁面(self.driver.get(....))

此外,scrapy是異步的,而selenium不是。 scrapy不是在scrapy yield request之後進行阻塞,而是繼續執行代碼,因爲它是在twisted上構建的,並且可以發起多個併發請求。 selenium驅動程序實例將無法遵循來自scrapy的多個併發請求。 (一個主角是用硒碼取代每個產量,即使這意味着執行時間的損失)

+0

我在parse_content中添加了self.driver.get(...),現在我可以獲得h1標題。儘管如此,繼續下一頁並不奏效。我應該如何通過硒代碼取代產量?你有一個例子嗎?我對scrapy或硒不太瞭解。謝謝! – BlackBat

+0

嘗試用函數* parse_content * 的內容替換'yield scrapy.Request(url = elem,callback = self.parse_content)行對於next_page問題,您可以通過圍繞代碼中的所有代碼解析函數(**,而**有下一頁,做smthg) – Pablo