如何使用Python 2.7遵守robots.txt？

我嘗試使用Python2.7抓取整個網站：如何使用Python 2.7遵守robots.txt？

我使用robotparser
我打開每一個環節「a」到網站，並
我將它們添加到分析的robots.txt文件要檢索的頁面列表關鍵是：我試圖避免Robots.txt文件中的所有路徑，但它們仍然在要爬網的頁面列表中。

如何從我的爬網列表中刪除Robot.txt路徑？

我cound't找到任何幫助，通過計算器呢。

我的代碼波紋管：

import robotparser 
 
import urlparse 
 
import urllib 
 
import urllib2 
 
from BeautifulSoup import * 
 

 
AGENT_NAME = 'PYMOTW' 
 
URL_BASE = 'website' 
 
urls = [URL_BASE] 
 
visited = [URL_BASE] # Create a copy 
 
parser = robotparser.RobotFileParser() 
 
parser.set_url(urlparse.urljoin(URL_BASE, 'robot.txt')) 
 
parser.read() 
 
PATHS = [ 
 
    '/..../', 
 

 
    ] 
 
for path in PATHS: 
 
    print '%6s : %s' % (parser.can_fetch(AGENT_NAME, path), path) 
 
    url = urlparse.urljoin(URL_BASE, path) 
 
    print '%6s : %s' % (parser.can_fetch(AGENT_NAME, url), url) 
 
    robot = [url] 
 
while (len(urls) > 0 and robot != True): 
 
    html = urllib.urlopen(urls[0]).read() 
 
    soup = BeautifulSoup(html) # Parse All HTML using BeautifulSoup 
 
    urls.pop(0) 
 
# Retrieve all of Tags as a list 
 
    for tags in soup.findAll('a', href = True): 
 
     tags['href'] = urlparse.urljoin(URL_BASE, tags['href']) 
 
     if URL_BASE in tags['href'] and tags['href'] not in visited: 
 
      urls.append(tags['href']) 
 
      visited.append(tags['href']) 
 
     c = len(visited) 
 
print visited 
 
print 'page visited', c

來源

2015-11-08 CDS

歡迎堆棧溢出！我編輯了您的帖子，以刪除僅適用於在Web瀏覽器中運行的HTML/JavaScript的代碼段功能。除了刪除Python 3標籤之外，我還修復了拼寫和添加格式以提高可讀性。像這樣改進你的問題會增加你閱讀你的問題並獲得很好答案的機會。 –

謝謝@AnthonyGeoghegan – CDS

Hi @ J.F.Sebastian。返回值是True值的列表。 – CDS

你在你的腳本有幾個錯誤。

我重構了很多，並試圖解釋。

首先，您使用的是循環，當你想要做遞歸（爲每個頁，你得到的，你得到的鏈接和重做的過程）。
然後出於某種原因，我不知道，urlparse.join失敗...（您的網址被截斷），所以我手動concat。
美麗的湯是沉重的，所以我重構，只解析鏈接，而不是整個頁面。一個頁面可以有相對和絕對的鏈接，所以你需要處理兩者。
robotparser似乎很蠢，路徑必須準確（/test和/test/對他來說是不一樣的）。他同時如果在robots.txt中沒有指定他們（測試http://example.com/test比賽*/test但不/test ...）
編輯不明白完整的URL：我通過過濾匹配的URL取得腳本有點多強大。

這給我：

import robotparser 
import urlparse 
import urllib 
from BeautifulSoup import BeautifulSoup, SoupStrainer 

AGENT_NAME = 'PYMOTW' 
URL_BASE = 'http://www.dcs.bbk.ac.uk/~martin/sewn/ls3' 
DOMAIN = urlparse.urlparse(URL_BASE).hostname 
visited = ['/'] # Create a copy 

parser = robotparser.RobotFileParser() 
parser.set_url(URL_BASE + '/robots.txt') 
parser.read() 


def process_url(url): 
    # transform relative paths 
    parsed_path = urlparse.urlparse(url) 
    if not parsed_path.hostname: 
     url = URL_BASE + url 

    # check domain 
    if parsed_path.hostname != DOMAIN: 
     print 'External domain ignored: %s' % parsed_path.hostname 

    # ensure we are allowed to fetch url 
    if not parser.can_fetch(AGENT_NAME, parsed_path.path): 
     print 'Not allowed to fetch %s' % parsed_path.path 
     return 

    # ensure we did not already visit it 
    if url in visited: 
     print 'Ignoring already visited %s' % url 
     return 

    print 'Visiting: %s' % url 
    html = urllib.urlopen(url).read() 
    visited.append(url) 
    links = BeautifulSoup(html, parseOnlyThese=SoupStrainer('a', href=True)) 

    # Retrieve all of Tags as a list 
    for link in links: 
     parsed_link = urlparse.urlparse(link['href']) 
     if len(link['href']) is 0: 
      print 'Ignoring empty link' 
     elif link['href'][0] == '#': 
      print 'Ignoring hash link %s' % link['href'] 
     elif parsed_path.hostname and parsed_link.scheme not in [None, 'http', 'https']: 
      print 'Ignoring non http(s) links %s' % link['href'] 
     else: 
      process_url(link['href']) 

PATHS = [ 
    '/testpage.html', 
    '/files/', 
    '/images/', 
    '/private/' 
] 
for path in PATHS: 
    process_url(path)

來源

2015-11-09 10:04:25 Cyrbil

我還沒有定義函數。我是初學者，所以我嘗試用我的基本知識創建一個爬行程序。你的更正對我來說很清楚。我很理解遞歸，即使它們在函數內部（第一次對我來說），但我有一些問題： - 爲什麼堆棧中的所有域被訪問？我可以只爲域/ ls3/...工作嗎？ - 在def process_url（url）：不清楚我的變量：url = URL_BASE + url（url？）它是被調用函數中的參數：process_url？ – CDS

如果您不在域之外，您可以過濾爲不抓取。在'print'之前放置一個新的if條件訪問：％s'％url'。還要注意，這個「bot」不跟蹤它在哪個域上，並且alwais預先加上了「URL_BASE」變量。是的，'url'是函數'process_url'的參數，但由於它可以是絕對路徑或相對路徑，所以我重新定義它始終以絕對url結束。 – Cyrbil

@Cyrbil如果我在打印之前插入一個新條件'訪問：％s'％url，程序將繼續閱讀機器人，而不是我真正需要抓取的標籤。要閱讀我的鏈接，我必須創建一個新的功能？，這是不明確的。我如何抓取頁面，如果我爬行oposite部分。「bot」究竟意味着什麼？ – CDS

如何使用Python 2.7遵守robots.txt？

回答

相關問題