如何查找並從網頁中提取鏈接？

我的網站，如http://site.com如何查找並從網頁中提取鏈接？

我想取主頁，只提取匹配的正則表達式的鏈接，例如.*somepage.*

的HTML代碼鏈接的格式可以是：

<a href="http://site.com/my-somepage">url</a> 
<a href="/my-somepage.html">url</a> 
<a href="my-somepage.htm">url</a>

我需要輸出格式：

http://site.com/my-somepage 
http://site.com/my-somepage.html 
http://site.com/my-somepage.htm

輸出url必須包含域名總是。

什麼是快速Python解決方案？

來源

2013-03-19 Alex

那你試試，沒有工作？ StackOverflow不是一種代碼編寫服務，但如果您首先解決問題，我們會爲您提供幫助。 – 2013-03-19 04:15:54

查看一個HTML解析模塊，比如BeautifulSoup。（會發佈一個鏈接，但我在我的手機上，對不起） – TerryA 2013-03-19 04:24:20

你可以使用lxml.html ：

from lxml import html 

url = "http://site.com" 
doc = html.parse(url).getroot() # download & parse webpage 
doc.make_links_absolute(url) 
for element, attribute, link, _ in doc.iterlinks(): 
    if (attribute == 'href' and element.tag == 'a' and 
     'somepage' in link): # or e.g., re.search('somepage', link) 
     print(link)

或者使用beautifulsoup4：

import re 
try: 
    from urllib2 import urlopen 
    from urlparse import urljoin 
except ImportError: # Python 3 
    from urllib.parse import urljoin 
    from urllib.request import urlopen 

from bs4 import BeautifulSoup, SoupStrainer # pip install beautifulsoup4 

url = "http://site.com" 
only_links = SoupStrainer('a', href=re.compile('somepage')) 
soup = BeautifulSoup(urlopen(url), parse_only=only_links) 
urls = [urljoin(url, a['href']) for a in soup(only_links)] 
print("\n".join(urls))

來源

2013-03-19 07:26:48 jfs

使用HTML解析模塊，如BeautifulSoup。
一些代碼（只有部分）：

from bs4 import BeautifulSoup 
import re 

html = '''<a href="http://site.com/my-somepage">url</a> 
<a href="/my-somepage.html">url</a> 
<a href="my-somepage.htm">url</a>''' 
soup = BeautifulSoup(html) 
links = soup.find_all('a',{'href':re.compile('.*somepage.*')}) 
for link in links: 
    print link['href']

輸出：

http://site.com/my-somepage 
/my-somepage.html 
my-somepage.htm

你應該能夠讓你從這麼多的數據需要的格式...

來源

2013-03-19 07:06:18 pradyunsg

Scrapy是最簡單的方法來做你想做的事。實際上有鏈接提取機制built-in。

讓我知道如果您需要編寫蜘蛛抓取鏈接的幫助。

請另見：

How do I use the Python Scrapy module to list all the URLs from my website?

來源

2013-03-19 07:30:12 alecxe

如何查找並從網頁中提取鏈接？

回答

相關問題