使用Python檢查網址

我想測試整個網站列表以查看網址是否有效，並且我想知道哪些網址沒有。使用Python檢查網址

import urllib2 

filename=open(argfile,'r') 
f=filename.readlines() 
filename.close() 

def urlcheck() : 
    for line in f: 
     try: 
      urllib2.urlopen() 
      print "SITE IS FUNCTIONAL" 
     except urllib2.HTTPError, e: 
      print(e.code) 
     except urllib2.URLError, e: 
      print(e.args) 
urlcheck()

來源

2017-02-04 cbos93

你的代碼如何不起作用？ – usr2564301

我建議你使用requests庫。

import requests 
resp = requests.get('your url') 
if not resp.ok: 
    print resp.status_code

來源

2017-02-04 14:55:32 Kroustou

你要通過網址

def urlcheck() : 
    for line in f: 
     try: 
      urllib2.urlopen(line) 
      print line, "SITE IS FUNCTIONAL" 
     except urllib2.HTTPError, e: 
      print line, "SITE IS NOT FUNCTIONAL" 
      print(e.code) 
     except urllib2.URLError, e: 
      print line, "SITE IS NOT FUNCTIONAL" 
      print(e.args) 
     except Exception,e: 
      print line, "Invalid URL"

一些邊緣情況或需要考慮的事情

錯誤碼點點HTTPError

從每個HTTP響應服務器包含一個數字「狀態碼」。有時狀態碼指示服務器無法完成請求。缺省處理程序將爲您處理其中一些響應（例如，如果響應是「重定向」，請求客戶端從不同的URL獲取文檔，則 urllib2將爲您處理該文檔）。對於那些無法處理的，urlopen 將引發HTTPError。典型錯誤包括'404'（找不到頁面），'403'（禁止請求）和'401'（需要驗證）。

即使HTTPError提高你可以檢查錯誤代碼

所以有時即使URL是有效的和可用它可能引發HTTPError代碼爲403，401等。 5xx由於臨時ServerErrors

來源

2017-02-04 14:55:33

你要通過URL作爲參數傳遞到的urlopen函數，

有時有效網址會給。

import urllib2 

filename=open(argfile,'r') 
f=filename.readlines() 
filename.close() 

def urlcheck() : 
    for line in f: 
     try: 
      urllib2.urlopen(line) # careful here 
      print "SITE IS FUNCTIONAL" 
     except urllib2.HTTPError, e: 
      print(e.code) 
     except urllib2.URLError, e: 
      print(e.args) 
urlcheck()

來源

2017-02-04 14:57:46

import urllib2 

def check(url): 
    request = urllib2.Request(url) 
    request.get_method = lambda : 'HEAD' # gets only headers without body (increase speed) 
    request.add_header('Content-Encoding', 'gzip, deflate, br') # gets archived headers (increase speed) 
    try: 
     response = urllib2.urlopen(request) 
     return response.getcode() <= 400 
    except Exception: 
     return False  

''' 
Contents of "/tmp/urls.txt" 

http://www.google.com 
https://fb.com 
http://not-valid 
http://not-valid.nvd 
not-valid 
''' 
filename = open('/tmp/urls.txt', 'r') 
urls = filename.readlines() 
filename.close() 

for url in urls: 
    print url + ' ' + str(check(url))

來源

2017-02-04 15:14:26 cetver

，我可能會寫這樣的：

import urllib2 

with open('urls.txt') as f: 
    urls = [url.strip() for url in f.readlines()] 

def urlcheck() : 
    for url in urls: 
     try: 
      urllib2.urlopen(url) 
     except (ValueError, urllib2.URLError) as e: 
      print('invalid url: {}'.format(url)) 

urlcheck()

從OP的原始實現一些變化：

使用上下文管理器打開/關閉數據文件
從文件中讀取URL中的換行符
使用更好的變量名
切換到更現代的異常處理風格
也搭上ValueError異常對錯誤的URL
顯示器一個更有用的錯誤信息

輸出示例：

$ python urlcheck.py 
invalid url: http://www.google.com/wertbh 
invalid url: htp:/google.com 
invalid url: google.com 
invalid url: https://wwwbad-domain-zzzz.com

來源

2017-02-04 15:17:12

使用Python檢查網址

回答

相關問題