如何從HTML代碼中正確提取網址？

我在電腦上的.txt文件中保存了網站的HTML代碼。我想用下面的代碼來提取這個文本文件中的所有URL：如何從HTML代碼中正確提取網址？

def get_net_target(page): 
    start_link=page.find("href=") 
    start_quote=page.find('"',start_link) 
    end_quote=page.find('"',start_quote+1) 
    url=page[start_quote+1:end_quote] 
    return url 
my_file = open("test12.txt") 
page = my_file.read() 
print(get_net_target(page))

然而，腳本只打印第一網址，但並非所有的其他環節。爲什麼是這樣？

來源

2017-03-06 jakeT888

你需要實現一個循環遍歷所有的URL。

print(get_net_target(page))只打印在page找到的第一個網址，所以你需要一次又一次地調用這個函數，每次由子page[end_quote+1:]更換page直到沒有更多的URL中找到。

爲了讓您一開始，next_index將存儲最後的結局URL位置，然後循環將檢索以下網址：

next_index = 0 # the next page position from which the URL search starts 

def get_net_target(page): 
    global next_index 

    start_link=page.find("href=") 
    if start_link == -1: # no more URL 
    return "" 
    start_quote=page.find('"',start_link) 
    end_quote=page.find('"',start_quote+1) 
    next_index=end_quote 
    url=page[start_quote+1:end_quote] 
    end_quote=5 
    return url 


my_file = open("test12.txt") 
page = my_file.read() 

while True: 
    url = get_net_target(page) 
    if url == "": # no more URL 
     break 
    print(url) 
    page = page[next_index:] # continue with the page

另外要小心，因爲你只檢索其封閉內部"聯繫，但他們可以附上'甚至沒有...

來源

2017-03-06 23:43:48 SegFault

謝謝您的回覆！我是python新手，你能舉一個例子來說明如何實現這個嗎？這將是非常有益的。 – jakeT888

我已經用示例代碼更新了答案，以幫助您瞭解使用自己的起始代碼的算法。 – SegFault

如何從HTML代碼中正確提取網址？

回答

相關問題