2013-04-04 71 views
1

我正在嘗試從維基頁面上專門刮掉一些文本this one. 我正在使用BeautifulSoup,或者至少嘗試......我對網頁瀏覽沒有真正的經驗。這是到目前爲止我的代碼...在美麗的湯中颳去維基頁面

import urllib 
import urllib.request 
from bs4 import BeautifulSoup 

soup =BeautifulSoup(urllib.request.urlopen('http://yugioh.wikia.com/wiki/Card_Tips:Blue-Eyes_White_Dragon').read()) 

for row in soup('span', {'class' : 'mw-headline'})[0].tbody('tr'): 
     tds = row('td') 
     print(tds[0].string, tds[1].string, tds[2].string) 

我只是試圖讓每個頭(通過,從特殊的手,等召喚檢索),並得到每個類別下的每個卡。任何人都可以給我一些建議嗎?

回答

2

如果將檢查HTML代碼,你會發現:

<div class="mw-content-ltr" dir="ltr" id="mw-content-text" lang="en"> 
... 
<h3> 
    <span class="mw-headline" id="Searchable_by"> 
    Searchable by 
    </span> 
... 
</h3> 
<ul> 
    <li> 
    " 
    <a href="/wiki/Summoner%27s_Art" title="Summoner's Art"> 
    Summoner's Art 
    </a> 
    " 
    </li> 
    <li> 
    " 
    <a href="/wiki/The_White_Stone_of_Legend" title="The White Stone of Legend"> 
    The White Stone of Legend 
    </a> 
    " 
    ... 
    </li> 
</ul> 
... 
<\div> 

上述片段顯示了一個事實:

  • id="mw-content-text"一個div包含了維基。
  • 標題在h3標籤的第一個(也是唯一)span
  • A ul標記包含項目符號列表。

所以在Python代碼:

from bs4 import BeautifulSoup 

soup = BeautifulSoup(open('stack.htm').read()) # I saved the webpage 
main_tag = soup.findAll('div',{'id':'mw-content-text'})[0] 

headers = main_tag.find_all('h3') 
ui_list = main_tag.find_all('ul') 
for i in range(len(headers)): 
    print(headers[i].span.get_text()) 
    print('\n -'.join(ui_list[i].get_text().split('\n'))) 
sections = zip((x.span.get_text() for x in headers), ('\n -'.join(x.get_text().split('\n')) for x in ui_list)) 
+0

OP使用Python 3 .. – 2013-04-04 15:26:56

1

你想找到所有<ul>元以下的標題,然後列出下這些鏈接獲得的卡片:

for headline in soup('span', {'class' : 'mw-headline'}): 
    print(headline.text) 
    links = headline.find_next('ul').find_all('a') 
    for link in links: 
     print('*', link.text)   

它打印:

Searchable by 
* Summoner's Art 
* The White Stone of Legend 
* Deep Diver 
Special Summoned from the hand by 
* Ancient Rules 
* Red-Eyes Darkness Metal Dragon 
* King Dragun 
* Kaibaman 

+0

這是一個完美的!謝謝。 – user1985351 2013-04-04 20:13:07