-1
我想抓取多個網站(從CSV文件),並從Chrome的「檢查元素」 - 源代碼(右鍵單擊網頁,然後選擇檢查元素)中提取某些關鍵字。眼下用selenium webdriver抓取多個網址
我可以從他們的「查看源代碼」中提取某些關鍵字-code(在網頁上點擊右鍵,然後選擇通過瀏覽器查看源代碼)這個腳本:
import urllib2
import csv
fieldnames = ['Website', '@media', 'googleadservices.com/pagead/conversion.js', 'googleadservices.com/pagead/conversion_async.js']
def csv_writerheader(path):
with open(path, 'w') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames, lineterminator='\n')
writer.writeheader()
def csv_writer(dictdata, path):
with open(path, 'a') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames, lineterminator='\n')
writer.writerow(dictdata)
csv_output_file = 'EXPORT_Results!.csv'
# LIST OF KEY WORDS (TITLE CASE TO MATCH FIELD NAMES)
keywords = ['@media', 'googleadservices.com/pagead/conversion.js', 'googleadservices.com/pagead/conversion_async.js']
csv_writerheader(csv_output_file)
with open('top1m-edited.csv', 'r') as f:
csv_f = csv.reader(f, lineterminator='\n')
for line in f:
strdomain = line.strip()
# INITIALIZE DICT
data = {'Website': strdomain}
if '.nl' in strdomain:
try:
req = urllib2.Request(strdomain.strip())
response = urllib2.urlopen(req)
html_content = response.read()
# ITERATE THROUGH EACH KEY AND UPDATE DICT
for searchstring in keywords:
if searchstring.lower() in str(html_content).lower():
print (strdomain, searchstring, 'found')
data[searchstring] = 'found'
else:
print (strdomain, searchstring, 'not found')
data[searchstring] = 'not found'
# CALL METHOD PASSING DICT AND OUTPUT FILE
csv_writer(data, csv_output_file)
except urllib2.HTTPError:
print (strdomain, 'HTTP ERROR')
except urllib2.URLError:
print (strdomain, 'URL ERROR')
except urllib2.socket.error:
print (strdomain, 'SOCKET ERROR')
except urllib2.ssl.CertificateError:
print (strdomain, 'SSL Certificate ERROR')
f.close()
這以下我編寫了一個代碼,用於從網站上獲取所需的「檢查元素」 - 源代碼,以便稍後使用上述腳本提取關鍵字(從CSV文件中的多個網站)。代碼:
from selenium import webdriver
driver = webdriver.Chrome(executable_path=r'C:\Users\Jacob\PycharmProjects\Testing\chromedriver_win32\chromedriver.exe')
driver.get('https://www.rocmn.nl/')
elem = driver.find_element_by_xpath("//*")
source_code = elem.get_attribute("outerHTML")
print(source_code)
我現在想以與第二個合併的第一個腳本來僅爬行「檢查元素」 - 源代碼(的所有網站的在CSV)並將結果導出爲CSV文件(如第一個腳本中所示)
我完全不知道從哪裏開始獲得此工作。請幫助
SO不是代碼寫入服務。我們在這裏幫助解決編程問題,但您首先需要付出一些努力。嘗試將這兩者結合起來,閱讀一些基本的編程教程,博客,書籍,並試一試。如果你無法正常工作,請回過頭來編輯這個問題,以更具體地說明你遇到的問題。 – JeffC
我知道。我只是要求別人指點我正確的方向。在這一點上,我真的不知道從哪裏開始。 – jakeT888