2010-08-26 65 views
2

我想將用API提取的維基百科內容轉換爲純文本。維基媒體頁面到Python中的文本

任何提示??

+0

我在我的博客寫了關於這個問題的一次:[中鏈接到MediaWiki模板可憎](http://hewgill.com/journal/條目/ 343最憎惡-的-的mediawiki-模板)。簡介:我發現沒有解析Mediawiki模板的語法和代碼*,除非*爲完全安裝Mediawiki本身。 – 2010-08-26 20:32:03

回答

1

有人認爲一些python mediawiki markup parsers/renderers,你幾乎可以從HTML轉換成你需要的風格的明文。不過,不知道實際工作會有多好。

0

我做了這個前幾天克隆維基媒體網站

import re 
from mediawikitools import * 
import os 
from sys import argv 

def list_all_pages(site): 
    query_results = api.APIRequest(site, {'action':'query', 'list':'allpages', 'aplimit':'500'}).query() 
    results = query_results['query']['allpages'] 
    return results 

def clone(site): 
    if not os.path.exists(site.siteinfo['sitename'][:20]): 
     print 'Make Dir', site.siteinfo['sitename'][:20] 
     os.makedirs(site.siteinfo['sitename'][:20]) 
    index = open(site.siteinfo['sitename'][:20] + '/' + 'Index','w') 

    pages = list_all_pages(site) 
    for test_page in pages: 
     if test_page['title'].rfind('/') != -1 and not os.path.exists(site.siteinfo['sitename'][:20] + '/' + test_page['title'][:test_page['title'].rfind('/')+1]): 
      #print test_page['title'][:test_page['title'].rfind('/')+1] 
      os.makedirs(site.siteinfo['sitename'][:20] + '/' + test_page['title'][:test_page['title'].rfind('/')+1]) 
     page_file = open(site.siteinfo['sitename'][:20] + '/' + test_page['title']+'.wiki', 'w') 
     try: 
      index.write(site.siteinfo['sitename'][:20] + '/' + test_page['title']+'.wiki') 
      wiki_file = page.Page(site, test_page['title']) 
      print site.siteinfo['sitename'][:20] + '/' + test_page['title']+'.wiki' 
      page_file.write(wiki_file.getWikiText()) 
     except KeyError, e: 
      print e 
     except UnicodeEncodeError, e: 
      print e 

if __name__ == '__main__': 
    site = wiki.Wiki("http://localhost/wiki/api.php") 
    site.setUserAgent('Mozilla/6.0 (Windows NT 6.2; WOW64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1') 
    print site.siteinfo['sitename'] 
    clone(site) 


    #site.login(username, password, force=true) if you need a username and password to acess it