2017-03-03 121 views
1

我有一個很多或URL的CSV文件,都有不同的域名擴展名(.com,.eu,.org等等)。但我只想在Python 2.7版使用if '.nl' in row:.nl擴展抓取域:如何只使用python從CSV文件抓取某些URL?

from selenium import webdriver 
import csv 

fieldnames = ['Website', '@media', 'googleadservices.com/pagead/conversion'] 

def csv_writerheader(path): 
    with open(path, 'w') as csvfile: 
     writer = csv.DictWriter(csvfile, fieldnames=fieldnames, lineterminator='\n') 
     writer.writeheader() 

def csv_writer(dictdata, path): 
    with open(path, 'a') as csvfile: 
     writer = csv.DictWriter(csvfile, fieldnames=fieldnames, lineterminator='\n') 
     writer.writerow(dictdata) 

csv_output_file = 'output!.csv' 

driver = webdriver.Chrome(executable_path=r'C:\Users\Jacob\PycharmProjects\Testing\chromedriver_win32\chromedriver.exe')  

keywords = ['@media', 'googleadservices.com/pagead/conversion'] 

csv_writerheader(csv_output_file) 

with open('top1m-edited.csv') as example_file: 
    example_reader = csv.reader(example_file) 
    for row in example_reader: 

     # INITIALIZE DICT 
     data = {'Website': row} 

     if '.nl' in row: # MAKING THE DOMAIN DISTINCTION HERE 
      try: 
       driver.get(row[0]) 
       html = driver.page_source  

       for searchstring in keywords: 
        if searchstring.lower() in html.lower(): 
         print (row, searchstring, 'FOUND!') 
         data[searchstring] = 'FOUND!' 
        else: 
         print (row, searchstring, 'not found') 
         data[searchstring] = 'not found'  

       csv_writer(data, csv_output_file) 

      except: 
       pass 

打印結果:

C:\Python27\python.exe "C:/Users/Jacob/PycharmProjects/Testing/fooling around 2.py" 

Process finished with exit code 0 

所以我的腳本基本上處於這種狀態不會做任何事情,除了導出CSV文件幾乎沒有結果。

但是,當我簡單地忽略了if '.nl' in row:,腳本完美地工作。

我應該做些什麼調整才能在腳本中導入/搜索.nl域名網址?

回答

1
for row in example_reader: 

type是一個列表。所以它正在尋找列表中正好是「.nl」的項目。你在這裏有幾個選擇。如果CSV文件將只包含與該URL的一列,你可以改變這一點:

if '.nl' in row: 

這樣:

if '.nl' in row[0]: 

編輯:此外,你有row任何轉讓將需要改變到row[0],如data = {'Website': row[0]}

+0

非常感謝你,它現在的作品! – jakeT888

相關問題