2017-10-21 119 views
1

無法讀取txt文件中的url 我想逐個讀取並打開txt中的url地址,我想從url地址的源中獲取標題的正則表達式 錯誤消息:從文件Python中讀取url

Traceback (most recent call last): File "Mypy.py", line 14, in UrlsOpen = urllib2.urlopen(listSplit) File "/usr/lib/python2.7/urllib2.py", line 154, in urlopen return opener.open(url, data, timeout) File "/usr/lib/python2.7/urllib2.py", line 420, in open req.timeout = timeout AttributeError: 'list' object has no attribute 'timeout'

Mypy.py

#!/usr/bin/env python 
# -*- coding: utf-8 -*- 
import re 
import requests 
import urllib2 
import threading 

UrlListFile = open("Url.txt","r") 
UrlListRead = UrlListFile.read() 
UrlListFile.close() 
listSplit = UrlListRead.split('\r\n') 


    UrlsOpen = urllib2.urlopen(listSplit) 
    ReadSource = UrlsOpen.read().decode('utf-8') 
    regex = '<title.*?>(.+?)</title>' 
    comp = re.compile(regex) 
    links = re.findall(comp,ReadSource) 
    for i in links: 
     SaveDataFiles = open("SaveDataMyFile.txt","w") 
     SaveDataFiles.write(i) 
    SaveDataFiles.close() 
+0

你可以添加你'Url.txt'內容的示例? – fievel

+0

@fievel我的Url.txt https://i.stack.imgur.com/s81Mt.png –

+0

你可以複製你的URL.txt文件的內容並使用代碼格式粘貼到你的問題中嗎?它會使你更容易地幫助你調試 – PeterH

回答

0

當你調用urllib2.urlopen(listSplit) listSplit是當它需要一個string or request object列表。這是迭代listSplit而不是將整個列表傳遞給urlopen的簡單方法。

另外re.findall()將返回搜索到的每個ReadSource的列表。您可以處理這幾種方法:

我選擇了通過只是讓列表

websites = [ [link, link], [link], [link, link, link]

的列表並遍歷兩個列表來處理它。這使得你可以爲每個網站的每個網址列表做一些特定的事情(放在不同的文件中)。

你也可以拼合website列表中只包含鏈接,而不是另一個列表則包含鏈接:

links = [link, link, link, link]

#!/usr/bin/env python 
# -*- coding: utf-8 -*- 
import re 
import urllib2 
from pprint import pprint 

UrlListFile = open("Url.txt", "r") 
UrlListRead = UrlListFile.read() 
UrlListFile.close() 
listSplit = UrlListRead.splitlines() 
pprint(listSplit) 
regex = '<title.*?>(.+?)</title>' 
comp = re.compile(regex) 
websites = [] 
for url in listSplit: 
    UrlsOpen = urllib2.urlopen(url) 
    ReadSource = UrlsOpen.read().decode('utf-8') 
    websites.append(re.findall(comp, ReadSource)) 

with open("SaveDataMyFile.txt", "w") as SaveDataFiles: 
    for website in websites: 
     for link in website: 
      pprint(link) 
      SaveDataFiles.write(link.encode('utf-8')) 
    SaveDataFiles.close() 
+0

Traceback(最近調用最後一次): 文件「Mypy.py」,第14行,在 UrlsOpen = urllib2.urlopen(url) 文件「/ usr/lib/python2.7/urllib2.py「,第154行,在urlopen中 返回opener.open(url,data,timeout) 文件」/usr/lib/python2.7/urllib2.py「,第427行,打開 req = meth(req) 文件「/usr/lib/python2.7/urllib2.py」,行1126,在do_request_ raise URLError('no host given') urllib2.URLError:

+0

我更新了代碼以處理更多的新鏈接'.splitlines()'並修復了一個編碼錯誤'link.encode('utf-8')'。嘗試新的代碼。 – PeterH