2017-04-12 72 views
0

我刮此頁面 https://en.wikipedia.org/wiki/Lev_Pavlovich_Rapoport如何提取html href屬性?

我已經到了這個地步

>>> s2.title.string 
u'Lev Pavlovich Rapoport - Wikipedia' 
>>> s2.a 
<a id="top"></a> 
>>> a2=s2.find_all("a") 

我會給只是幾行

[<a id="top"></a>, <a href="#mw-head">navigation</a>, <a href="#p-search">search</a>, <a class="image" href="/wiki/File:LPRapoport1.jpg"><img alt="" class="thumbimage" data-file-height="374" data-file-width="295" height="279" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/bb/LPRapoport1.jpg/220px-LPRapoport1.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/b/bb/LPRapoport1.jpg 1.5x" width="220"/></a>, <a class="internal" href="/wiki/File:LPRapoport1.jpg" title="Enlarge"></a>, <a href="/wiki/Russian_language" title="Russian language">Russian language</a>, <a href="#Early_work"><span class="tocnumber">1</span> <span class="toctext">Early work</span></a>, <a href="#Further_work"><span class="tocnumber">2</span> <span class="toctext">Further work</span></a>, <a href="#Co-workers"><span class="tocnumber">3</span> <span class="toctext">Co-workers</span></a>, <a href="#Recognition"><span class="tocnumber">4</span> <span class="toctext">Recognition</span></a>, <a href="#External_links"><span class="tocnumber">5</span> <span class="toctext">External links</span></a>, <a href= 

的現在,下一步將採取href屬性,但是如何?

+0

Linux?你用bash命令解析HTML嗎? –

+0

這是一個Python shell嗎? – Quentin

+0

@ÁlvaroGonzález是的,不,我在Python shell。 –

回答

3

如果您按照我的想法使用BeautifulSoup,則可以這樣做。

for a in s2.find_all('a', href=True): 
    print "Found the URL:", a['href']