2017-09-26 141 views
1

我想問題出在哪裏,我的代碼解析Python的隊列對象

from queue import Queue 
from threading import Thread 
from html.parser import HTMLParser 
import urllib.request 

hosts = ["http://yahoo.com", "http://google.com", "http://ibm.com"] 

queue = Queue() 

class ThreadUrl(Thread): 
    def __init__(self, queue): 
     Thread.__init__(self) 
     self.queue = queue 

    def run(self): 
     while True: 
     host = self.queue.get() 
     url=urllib.request.urlopen(host) 
     url.read(4096) 
     self.queue.task_done() 


class MyHTMLParser(HTMLParser): 
    def handle_starttag(self, tag, attrs): 
     print("Start tag:", tag) 
     for attr in attrs: 
      print("  attr:", attr) 



def consumer(): 
    for i in range(3): 
     t = ThreadUrl(queue) 
     t.setDaemon(True) 
     t.start() 

    for host in hosts: 
     parser = MyHTMLParser() 
     parser.feed(host) 
     queue.put(host) 
    queue.join() 

consumer() 

我的目標是提取URL的內容,讀取隊列,最後解析它。當我執行它不代碼打印任何東西。我應該在哪裏放置解析器?

+0

parser.feed(主機)已經沒有任何意義,你需要調用飼料方法與url.read(4096)返回的HTML。 – lcastillov

+0

@lcastillov我現在明白了,但是我應該做新課還是什麼? – MishaVacic

+0

在run方法內部使用解析器,並將URL插入到隊列中。在ThreadUrl.run方法內創建一個MyHTMLParser類並處理傳入主機。 – lcastillov

回答

1

下面是一個例子:

from queue import Queue 
from threading import Thread 
from html.parser import HTMLParser 
import urllib.request 


NUMBER_OF_THREADS = 3 


HOSTS = ["http://yahoo.com", "http://google.com", "http://ibm.com"] 


class MyHTMLParser(HTMLParser): 
    def handle_starttag(self, tag, attrs): 
     print("Start tag:", tag) 
     for attr in attrs: 
      print("\tattr:", attr) 


class ThreadUrl(Thread): 
    def __init__(self, queue): 
     Thread.__init__(self) 
     self.queue = queue 

    def run(self): 
     while True: 
      host = self.queue.get() 
      url = urllib.request.urlopen(host) 
      content = str(url.read(4096)) 
      parser = MyHTMLParser() 
      parser.feed(content) 
      self.queue.task_done() 


def consumer(): 
    queue = Queue() 
    for i in range(NUMBER_OF_THREADS): 
     thread = ThreadUrl(queue) 
     thread.setDaemon(True) 
     thread.start() 
    for host in HOSTS: 
     queue.put(host) 
    queue.join() 


if __name__ == '__main__': 
    consumer()