如何使用的Xapian索引網頁

我使用的Ubuntu 12.04，當它返回一個URL，Python 2.7版如何使用的Xapian索引網頁

我從給定的URL獲取內容代碼：

def get_page(url): 
'''Gets the contents of a page from a given URL''' 
    try: 
     f = urllib.urlopen(url) 
     page = f.read() 
     f.close() 
     return page 
    except: 
     return "" 
    return ""

要過濾的內容通過get_page(url)提供的頁面：

def filterContents(content): 
'''Filters the content from a page''' 
    filteredContent = '' 
    regex = re.compile('(?<!script)[>](?![\s\#\'-<]).+?[<]') 
    for words in regex.findall(content): 
     word_list = split_string(words, """ ,"!-.()<>[]{};:?!-=/_`&""") 
     for word in word_list: 
      filteredContent = filteredContent + word 
    return filteredContent 

def split_string(source, splitlist): 
    return ''.join([ w if w not in splitlist else ' ' for w in source])

如何索引Xapian的filteredContent這樣，當我詢問，我得到的返回URLs查詢出現在？

來源

2013-04-20 VeilEclipse

我不完全清楚你的filterContents()和split_string()實際上是在做什麼（扔掉一些HTML標籤內容，然後分開文字），所以讓我來談談一個類似的問題，它沒有將複雜性摺疊到它。

我們假設我們有一個函數strip_tags()，它返回HTML文檔的文本內容，以及您的get_page()函數。我們想建立地方

每個文件指的是資源表示來自特定URL拉
在表示（已經通過strip_tags()通過）的「話」成爲搜索項的Xapian的數據庫索引這些文件
每個文檔都包含其所有從中拉出的網址，作爲其document data。

所以，你可以指標如下：

import xapian 
def index_url(database, url): 
    text = strip_tags(get_page(url)) 
    doc = xapian.Document() 

    # TermGenerator will split text into words 
    # and then (because we set a stemmer) stem them 
    # into terms and add them to the document 
    termgenerator = xapian.TermGenerator() 
    termgenerator.set_stemmer(xapian.Stem("en")) 
    termgenerator.set_document(doc) 
    termgenerator.index_text(text) 

    # We want to be able to get at the URL easily 
    doc.set_data(url) 
    # And we want to ensure each URL only ends up in 
    # the database once. Note that if your URLs are long 
    # then this won't work; consult the FAQ on unique IDs 
    # for more: http://trac.xapian.org/wiki/FAQ/UniqueIds 
    idterm = 'Q' + url 
    doc.add_boolean_term(idterm) 
    db.replace_document(idterm, doc) 

# then index an example URL 
db = xapian.WritableDatabase("exampledb", xapian.DB_CREATE_OR_OPEN) 

index_url(db, "https://stackoverflow.com/")

搜索是那麼簡單的，但如果需要，它可以明顯地變得更加複雜：

qp = xapian.QueryParser() 
qp.set_stemmer(xapian.Stem("en")) 
qp.set_stemming_strategy(qp.STEM_SOME) 
query = qp.parse_query('question') 
query = qp.parse_query('question and answer') 
enquire = xapian.Enquire(db) 
enquire.set_query(query) 
for match in enquire.get_mset(0, 10): 
    print match.document.get_data()

這將顯示 'https://stackoverflow.com/'，因爲當您沒有登錄時，「主題和答案」在主頁上。

我建議您查看Xapian getting started guide這兩個概念和代碼。

來源

2013-04-22 13:28:21

謝謝你的時間和幫助。如何顯示頁面內容和URL？ – VeilEclipse 2013-04-24 09:32:58

掌握Xapian的概念。例如，您可以在文檔數據中放入任何您想要的東西;正確的處理方式取決於你的情況和你在做什麼，所以我不能給出具體的建議。 – 2013-04-25 14:35:44

如何使用的Xapian索引網頁

回答

相關問題