使用Python從URL列表中查找特定的URL

我想通過爬行瀏覽URL列表中是否存在特定的鏈接。我寫了下面的程序，它完美的工作。但是，我被困在2個地方。使用Python從URL列表中查找特定的URL

而不是使用數組，我怎樣才能從文本文件調用鏈接。
爬行器需要4分鐘的時間來抓取100個網頁。

有沒有一種方法可以讓我更快。

from bs4 import BeautifulSoup, SoupStrainer 
import urllib2 
import re 
import threading 

start = time.time() 
#Links I want to find 
url = "example.com/one", "example.com/two", "example.com/three"] 

#Links I want to find the above links in... 
url_list =["example.com/1000", "example.com/1001", "example.com/1002", 
"example.com/1003", "example.com/1004"] 

print_lock = threading.Lock() 
#with open("links.txt") as f: 
# url_list1 = [url.strip() for url in f.readlines()] 

def fetch_url(url): 
    for line1 in url_list: 
     print "Crawled" " " + line1 
     try: 
      html_page = urllib2.urlopen(line1) 
      soup = BeautifulSoup(html_page) 
      link = soup.findAll(href=True) 
     except urllib2.HTTPError: 
     pass 
     for link1 in link: 
      url1 = link1.get("href") 
      for url_input in url: 
       if url_input in url1: 
        with print_lock: 
         print 'Found' " " +url_input+ " " 'in'+ " " + line1 

threads = [threading.Thread(target=fetch_url, args=(url,)) for url in url_list] 
for thread in threads: 
thread.start() 
for thread in threads: 
thread.join() 
print('Entire job took:',time.time() - start)

來源

2015-09-06 import.zee

由於科裏的例子。總是與之鬥爭。 –

。我已經根據幾個例子進行了編輯。程序速度要快得多，但是輸出會多次打印相同的答案，有時輸出錯誤。我使用lock（）函數來防止它...不工作。我還沒有想出多線程。這裏的任何幫助都非常感謝。提前致謝。 –

如果您想從文本文件讀取，請使用您註釋掉的代碼。

至於「性能」問題：您的代碼會在讀取操作urlopen處阻止，直到返回網站的內容爲止。理想情況下，您希望並行運行這些請求。您需要一個並行解決方案，例如使用線程。

Here's使用不同的方法，使用GEVENT（非標準）

來源

2015-09-06 16:11:26 Felk

你是什麼意思Python自然不支持[multithreading]（https://docs.python.org/2/library/multiprocessing.html）？ – MattDMo

多線程！=多處理。你不想讓100個流程去做100個請求 – Felk

你能解釋一下自己嗎？ Python也支持[threading]（https://docs.python.org/2/library/threading.html），如果你想同時運行多個I/O綁定任務，「*仍然是一個合適的模型。*」 – MattDMo

使用Python從URL列表中查找特定的URL

回答

相關問題