執行我的類爬蟲時遇到問題

當我使用類來抓取任何Web數據時，我完全是python的新手。所以，對於任何嚴重的錯誤，事先道歉。我編寫了一個腳本來使用wikipedia網站上的a標籤解析文本。我試圖從我的級別準確地編寫代碼，但由於某種原因，當我執行代碼時會拋出錯誤。我的代碼和錯誤在下面給出，供您考慮。執行我的類爬蟲時遇到問題

腳本：

import requests 
from lxml.html import fromstring 

class TextParser(object): 

    def __init__(self): 
     self.link = 'https://en.wikipedia.org/wiki/Main_Page' 
     self.storage = None 

    def fetch_url(self): 
     self.storage = requests.get(self.link).text 

    def get_text(self): 
     root = fromstring(self.storage) 
     for post in root.cssselect('a'): 
      print(post.text) 

item = TextParser() 
item.get_text()

錯誤：

Traceback (most recent call last): 
    File "C:\Users\mth\AppData\Local\Programs\Python\Python35-32\testmatch.py", line 38, in <module> 
    item.get_text() 
    File "C:\Users\mth\AppData\Local\Programs\Python\Python35-32\testmatch.py", line 33, in get_text 
    root = fromstring(self.storage) 
    File "C:\Users\mth\AppData\Local\Programs\Python\Python35-32\lib\site-packages\lxml\html\__init__.py", line 875, in fromstring 
    is_full_html = _looks_like_full_html_unicode(html) 
TypeError: expected string or bytes-like object

來源

2017-10-18 shayan

你執行下面兩行

item = TextParser() 
item.get_text()

當初始化TextParser，self.storage等於無。當你執行函數get_text（）時，它仍然等於None。所以這就是爲什麼你會得到這個錯誤。

但是，如果將其更改爲以下內容。 self.storage應該填充一個字符串，而不是沒有。

item = TextParser() 
item.fetch_url() 
item.get_text()

如果你想調用的函數get_text無需調用fetch_url你能做到這樣。

來源

2017-10-18 20:56:36 Jonathan

謝謝先生喬納森，它現在有效。我們很快就會接受它作爲答案。請不要忽略提供關於如何在不調用'fetch_url（）'的情況下執行刮板的建議。這是我第一次嘗試的。非常感謝，非常感謝。 – shayan

那麼，你可以在函數get_text中調用fetch_url。 – Jonathan

非常感謝。就是這樣。 – shayan

執行我的類爬蟲時遇到問題

回答

相關問題