2017-02-28 123 views
-1

我想抓取多個網站(從CSV文件),並從Chrome的「檢查元素」 - 源代碼(右鍵單擊網頁,然後選擇檢查元素)中提取某些關鍵字。眼下用selenium webdriver抓取多個網址

我可以從他們的「查看源代碼」中提取某些關鍵字-code(在網頁上點擊右鍵,然後選擇通過瀏覽器查看源代碼)這個腳本:

import urllib2 
import csv 

fieldnames = ['Website', '@media', 'googleadservices.com/pagead/conversion.js', 'googleadservices.com/pagead/conversion_async.js'] 

def csv_writerheader(path): 
    with open(path, 'w') as csvfile: 
     writer = csv.DictWriter(csvfile, fieldnames=fieldnames, lineterminator='\n') 
     writer.writeheader() 

def csv_writer(dictdata, path): 
    with open(path, 'a') as csvfile: 
     writer = csv.DictWriter(csvfile, fieldnames=fieldnames, lineterminator='\n') 
     writer.writerow(dictdata) 

csv_output_file = 'EXPORT_Results!.csv' 
# LIST OF KEY WORDS (TITLE CASE TO MATCH FIELD NAMES) 
keywords = ['@media', 'googleadservices.com/pagead/conversion.js', 'googleadservices.com/pagead/conversion_async.js'] 

csv_writerheader(csv_output_file) 

with open('top1m-edited.csv', 'r') as f: 
    csv_f = csv.reader(f, lineterminator='\n') 
    for line in f: 
     strdomain = line.strip() 
     # INITIALIZE DICT 
     data = {'Website': strdomain} 

     if '.nl' in strdomain: 
      try: 
       req = urllib2.Request(strdomain.strip()) 
       response = urllib2.urlopen(req) 
       html_content = response.read() 

       # ITERATE THROUGH EACH KEY AND UPDATE DICT 
       for searchstring in keywords: 
        if searchstring.lower() in str(html_content).lower(): 
         print (strdomain, searchstring, 'found') 
         data[searchstring] = 'found' 
        else: 
         print (strdomain, searchstring, 'not found') 
         data[searchstring] = 'not found' 

       # CALL METHOD PASSING DICT AND OUTPUT FILE 
       csv_writer(data, csv_output_file) 

      except urllib2.HTTPError: 
       print (strdomain, 'HTTP ERROR') 

      except urllib2.URLError: 
       print (strdomain, 'URL ERROR') 

      except urllib2.socket.error: 
       print (strdomain, 'SOCKET ERROR') 

      except urllib2.ssl.CertificateError: 
       print (strdomain, 'SSL Certificate ERROR') 

f.close() 

這以下我編寫了一個代碼,用於從網站上獲取所需的「檢查元素」 - 源代碼,以便稍後使用上述腳本提取關鍵字(從CSV文件中的多個網站)。代碼:

from selenium import webdriver 

driver = webdriver.Chrome(executable_path=r'C:\Users\Jacob\PycharmProjects\Testing\chromedriver_win32\chromedriver.exe') 
driver.get('https://www.rocmn.nl/') 

elem = driver.find_element_by_xpath("//*") 
source_code = elem.get_attribute("outerHTML") 

print(source_code) 

我現在想以與第二個合併的第一個腳本來爬行「檢查元素」 - 源代碼(的所有網站的在CSV)並將結果導出爲CSV文件(如第一個腳本中所示)

我完全不知道從哪裏開始獲得此工作。請幫助

+0

SO不是代碼寫入服務。我們在這裏幫助解決編程問題,但您首先需要付出一些努力。嘗試將這兩者結合起來,閱讀一些基本的編程教程,博客,書籍,並試一試。如果你無法正常工作,請回過頭來編輯這個問題,以更具體地說明你遇到的問題。 – JeffC

+0

我知道。我只是要求別人指點我正確的方向。在這一點上,我真的不知道從哪裏開始。 – jakeT888

回答

0

從源收集關鍵字不是正確的方法。來自身體部分和元標記的關鍵詞很重要。不管你得到什麼,你只需要遞減到1,

private Object getTotalCount(String strKeyword) { 
    // TODO Getting total count for given keyword 
    // Setting up Javascript executor for executing javascript on a page. Make 
    // sure HTMLUNIDriver/Any driver having javascript enabled. 
    JavascriptExecutor jsExecutor = wdHTMLUnitDriver; 
    // System.out.println(driver.getCurrentUrl()); 
    // Counting up keyword on body of the web page only 
    Object objCount = null; 
    try { 
     objCount = jsExecutor.executeScript(
      "var temp = document.getElementsByTagName('body')[0].innerText;\r\nvar substrings = temp.split(arguments[0]);\r\n \r\nreturn (substrings.length);", 
      strKeyword); 
    } catch (Exception e) { 
     // TODO Auto-generated catch block 
     e.printStackTrace(); 
    } 
    // System.out.println(obj.toString()); 
    if (objCount.equals(null)) 
     return null; 
    // Returning total count found by javascript executor. 
    return objCount.toString(); 
}