2017-08-07 134 views
1

我目前正在運行此代碼以將文章url鏈接拖放到csv文件中,並且還可以訪問這些url(在csv文件中)以將相應的信息刮到文本文件中。無法糾正 - ValueError:未知的url類型:鏈接

我能刮鏈接CSV文件,但我無法訪問csv文件湊進一步的信息(也未創建的文本文件),我遇到一個ValueError

import csv 
from lxml import html 
from time import sleep 
import requests 
from bs4 import BeautifulSoup 
import urllib 
import urllib2 
from random import randint 

outputFile = open("All_links.csv", r'wb') 
fileWriter = csv.writer(outputFile) 

fileWriter.writerow(["Link"]) 
#fileWriter.writerow(["Sl. No.", "Page Number", "Link"]) 

url1 = 'https://www.marketingweek.com/page/' 
url2 = '/?s=big+data' 

sl_no = 1 

#iterating from 1st page through 361th page 
for i in xrange(1, 361): 

    #generating final url to be scraped using page number 
    url = url1 + str(i) + url2 

    #Fetching page 
    response = requests.get(url) 
    sleep(randint(10, 20)) 
    #using html parser 
    htmlContent = html.fromstring(response.content) 

    #Capturing all 'a' tags under h2 tag with class 'hentry-title entry-title' 
    page_links = htmlContent.xpath('//div[@class = "archive-constraint"]//h2[@class = "hentry-title entry-title"]/a/@href') 
    for page_link in page_links: 
     print page_link 
     fileWriter.writerow([page_link]) 
     sl_no += 1 

with open('All_links.csv', 'rb') as f1: 
    f1.seek(0) 
    reader = csv.reader(f1) 

    for line in reader: 
     url = line[0]  
     soup = BeautifulSoup(urllib2.urlopen(url)) 


     with open('LinksOutput.txt', 'a+') as f2: 
      for tag in soup.find_all('p'): 
       f2.write(tag.text.encode('utf-8') + '\n') 

這是我遇到的錯誤:

File "c:\users\rrj17\documents\visual studio 2015\Projects\webscrape\webscrape\webscrape.py", line 47, in <module> 
    soup = BeautifulSoup(urllib2.urlopen(url)) 
    File "C:\Python27\lib\urllib2.py", line 154, in urlopen 
    return opener.open(url, data, timeout) 
    File "C:\Python27\lib\urllib2.py", line 421, in open 
    protocol = req.get_type() 
    File "C:\Python27\lib\urllib2.py", line 283, in get_type 
    raise ValueError, "unknown url type: %s" % self.__original 
ValueError: unknown url type: Link 

請求一些幫助。

回答

2

嘗試跳過您的csv文件中的第一行...您可能在不知不覺中試圖解析標題。

with open('All_links.csv', 'rb') as f1: 
    reader = csv.reader(f1) 
    next(reader) # read the header and send it to oblivion 

    for line in reader: # NOW start reading 
     ... 

你也不需要f1.seek(0),因爲f1自動指向在讀模式下文件的開始。

+0

問題解決!萬分感謝! :) – Rrj17

+0

@ Rrj17你沒有標記這...錯了什麼? –

+1

沒有..發生了錯誤。對不起,造成了混亂。 – Rrj17

相關問題