2009-06-20 75 views

回答

4

如果您的意思是「我只想獲得wikitext」,那麼請看wikipedia.Page類和get方法。

import wikipedia 

site = wikipedia.getSite('en', 'wikipedia') 
page = wikipedia.Page(site, 'Test') 

print page.get() # '''Test''', '''TEST''' or '''Tester''' may refer to: 
#==Science and technology== 
#* [[Concept inventory]] - an assessment to reveal student thinking on a topic. 
# ... 

這樣您就可以從文章中獲得完整的原始wiki文本。

如果要刪除wiki語法,就像將[[Concept inventory]]轉換爲Concept庫存等一樣,這將會更加痛苦。

這個問題的主要原因是MediaWiki wiki語法沒有定義的語法。這使得解析和剝離非常困難。我目前不知道哪種軟件可以讓你準確地做到這一點。當然有MediaWiki Parser類,但它是PHP,有點難以掌握,其目的非常不同。

但是,如果你只是想去掉鏈接,或非常簡單的wiki結構使用正則表達式:

text = re.sub('\[\[([^\]\|]*)\]\]', '\\1', 'Lorem ipsum [[dolor]] sit amet, consectetur adipiscing elit.') 
print text #Lorem ipsum dolor sit amet, consectetur adipiscing elit. 

,然後管道鏈接:

text = re.sub('\[\[(?:[^\]\|]*)\|([^\]\|]*)\]\]', '\\1', 'Lorem ipsum [[dolor|DOLOR]] sit amet, consectetur adipiscing elit.') 
print text #Lorem ipsum DOLOR sit amet, consectetur adipiscing elit. 

等。

但例如,有一個從網頁去掉嵌套模板,沒有可靠的簡便方法。對於在評論中有鏈接的圖片也是如此。這非常困難,並涉及遞歸刪除最內部的鏈接並用標記替換它並重新開始。如果需要,可以查看wikipedia.py中的templateWithParams函數,但這不太好。

+0

顯然我誤解了問題的範圍。鑑於沒有其他答案,我盡了最大的努力。 :-) – cdleary 2009-06-21 20:10:42

0

有一個名爲​​模塊,可以讓你很接近你根據你需要什麼想要什麼。它有一個名爲strip_code()的方法,它剝去了很多標記。

import pywikibot 
import mwparserfromhell 

test_wikipedia = pywikibot.Site('en', 'test') 
text = pywikibot.Page(test_wikipedia, 'Lestat_de_Lioncourt').get() 

full = mwparserfromhell.parse(text) 
stripped = full.strip_code() 

print full 
print '*******************' 
print stripped 

比較片段:

{{db-foreign}} 
<!-- Commented out because image was deleted: [[Image:lestat_tom_cruise.jpg|thumb|right|[[Tom Cruise]] as Lestat in the film ''[[Interview With The Vampire: The Vampire Chronicles]]''|{{deletable image-caption|1=Friday, 11 April 2008}}]] --> 

[[Image:lestat.jpg|thumb|right|[[Stuart Townsend]] as Lestat in the film ''[[Queen of the Damned (film)|Queen of the Damned]]'']] 

[[Image:Lestat IWTV.jpg|thumb|right|[[Tom Cruise]] as Lestat in the 1994 film ''[[Interview with the Vampire (film)|Interview with the Vampire]]'']] 

'''Lestat de Lioncourt''' is a [[fictional character]] appearing in several [[novel]]s by [[Anne Rice]], including ''[[The Vampire Lestat]]''. He is a [[vampire]] and the main character in the majority of ''[[The Vampire Chronicles]]'', narrated in first person. 

==Publication history== 
Lestat de Lioncourt is the narrator and main character of the majority of the novels in Anne Rice's ''The Vampire Chronicles'' series. ''[[The Vampire Lestat]]'', the second book in the series, is presented as Lestat's autobiography, and follows his exploits from his youth in France to his early years as a vampire. Many of the other books in the series are also credited as being written by Lestat. 


******************* 

thumb|right|Stuart Townsend as Lestat in the film ''Queen of the Damned'' 

'''Lestat de Lioncourt''' is a fictional character appearing in several novels by Anne Rice, including ''The Vampire Lestat''. He is a vampire and the main character in the majority of ''The Vampire Chronicles'', narrated in first person. 

Publication history 
Lestat de Lioncourt is the narrator and main character of the majority of the novels in Anne Rice's ''The Vampire Chronicles'' series. ''The Vampire Lestat'', the second book in the series, is presented as Lestat's autobiography, and follows his exploits from his youth in France to his early years as a vampire. Many of the other books in the series are also credited as being written by Lestat. 
相關問題