帕爾斯和HTML網頁內容中提取網址，而無需使用BeautifulSoup或urlib庫

我是新的蟒蛇，我非常抱歉，如果我的問題是非常基本的。在我的程序中，我需要分析一個html網頁並提取其中的所有鏈接。假設我的網頁內容，如下面：帕爾斯和HTML網頁內容中提取網址，而無需使用BeautifulSoup或urlib庫

<html><head><title>Fakebook</title><style TYPE="text/css"><!-- 
#pagelist li { display: inline; padding-right: 10px; } 
--></style></head><body><h1>testwebapp</h1><p><a href="/testwebapp/">Home</a></p><hr/><h1>Welcome to testwebapp</h1><p>Random URLs!</p><ul><li><a href="/testwebapp/847945358/">Rennie Tach</a></li><li><a href="/testwebapp/848854776/">Pid Ko</a></li><li><a href="/testwebapp/850558104/">Ler She</a></li><li><a href="/testwebapp/851635068/">iti Sar</a></li><li><a </ul> 
<p>Page 1 of 2 
<ul id="pagelist"><li> 
1 

</li><li><a href="/testwebapp/570508160/fri/2/">2</a></li><li><a href="/testwebapp/570508160/fri/2/">next</a></li><li><a href="/testwebapp/570508160/fri/2/">last</a></li></ul></p> 
</body></html>

現在，我需要這個標準桿網頁內容，並提取所有內部的鏈接。換句話說，我需要下面的內容從網頁提取：

/testwebapp/847945358/ 
/testwebapp/848854776/ 
/testwebapp/850558104/ 
/testwebapp/851635068/ 
/testwebapp/570508160/fri/2/ 
/testwebapp/570508160/fri/2/ 
/testwebapp/570508160/fri/2/

我搜索了很多關於解析使用python如this，this或this網頁，但其中許多人都使用的庫如urlib或urlib2或BeautifulSoup並請求我不能在我的程序中使用這些庫。因爲我的應用程序將在未安裝這些庫的機器上運行。所以我需要手動解析我的網頁內容。我的想法是，我將我的網頁內容保存在一個字符串中，然後將字符串（（用空格分隔））轉換爲字符串數組，然後檢查我的數組中的每個項目，如果它有/testwebapp/或fri關鍵字，則保存在一個數組中。但是，當我使用以下命令將字符串包含我的網頁內容到一個數組，我得到這個錯誤：

arrayofwords_fromwebpage = (webcontent_saved_in_a_string).split(" ")

和錯誤是：

TypeError: a bytes-like object is required, not 'str'

有沒有快速和高效如何在不使用任何庫（如urlib，urlib2或BeautifulSoup）的情況下解析和提取html網頁內的鏈接？

來源

2017-09-17 Shahrooz Pooryousef

如果所有你需要的是發現的所有URL中僅使用Python中，此功能將幫助您：

def search(html): 
    HREF = 'a href="' 
    res = [] 
    s, e = 0, 0 
    while True: 
     s = html.find(HREF, e) 
     if s == -1: 
      break 
     e = html.find('">', s) 
     res.append(html[s+len(HREF):e]) 

    return res

來源

2017-09-17 17:00:55 AndMar

這是完美@ AndMar.tnx –

您可以使用的東西從標準庫，即HTMLParser的。

我通過觀看'a'標籤來爲您的目的劃分子類。當解析器遇到一個時，它會查找'href'屬性，如果它存在，它將打印它的值。

爲了執行它，我實例化子類，然後給它的feed方法提供您在問題中呈現的HTML。

您可以在本答案結束時查看結果。

>>> from html.parser import HTMLParser 
>>> class SharoozHTMLParser(HTMLParser): 
...  def handle_starttag(self, tag, attrs): 
...   if tag == 'a': 
...    attrs = {k: v for (k, v) in attrs} 
...    if 'href' in attrs: 
...     print (attrs['href']) 
...     
>>> parser = SharoozHTMLParser() 
>>> parser.feed(open('temp.htm').read()) 
/testwebapp/ 
/testwebapp/847945358/ 
/testwebapp/848854776/ 
/testwebapp/850558104/ 
/testwebapp/851635068/ 
/testwebapp/570508160/fri/2/ 
/testwebapp/570508160/fri/2/ 
/testwebapp/570508160/fri/2/

來源

2017-09-17 17:20:57

謝謝@Bill Bell，它肯定會起作用，我會用它。 –

非常歡迎。你會幫我一個忙，並將其標記爲'接受'嗎？ –

對不起，我收回。沒有注意到你已經接受了另一個答案。 –

帕爾斯和HTML網頁內容中提取網址，而無需使用BeautifulSoup或urlib庫

回答

相關問題