分組相關搜索關鍵字

我有一個日誌文件，其中包含搜索查詢輸入到我的網站的搜索引擎。我想將相關的搜索查詢「分組」在一起以獲得報告。我爲我的大多數web應用程序使用Python - 所以解決方案可以是基於Python的，或者如果使用SQL更容易執行此操作，則可以將字符串加載到Postgres中。分組相關搜索關鍵字

示例數據：

dog food 
good dog trainer 
cat food 
veterinarian

組應包括：

貓：
cat food

狗：

dog food 
good dog trainer

食品：

dog food 
cat food

等等

想法？某種「索引算法」也許？

來源

2010-02-16 erikcw

我不知道我明白。你能否解釋你打算如何決定哪些詞是相關的？或者是這個問題？ – 2010-02-16 20:07:35

f = open('data.txt', 'r') 
raw = f.readlines() 

#generate set of all possible groupings 
groups = set() 
for lines in raw: 
    data = lines.strip().split() 
    for items in data: 
     groups.add(items) 

#parse input into groups 
for group in groups: 
    print "Group \'%s\':" % group 
    for line in raw: 
     if line.find(group) is not -1: 
      print line.strip() 
    print 

#consider storing into a dictionary instead of just printing

這可能是高度優化的，但是這將打印以下結果，假設您將原始數據的外部文本文件：

Group 'trainer': 
good dog trainer 

Group 'good': 
good dog trainer 

Group 'food': 
dog food 
cat food 

Group 'dog': 
dog food 
good dog trainer 

Group 'cat': 
cat food 

Group 'veterinarian': 
veterinarian

來源

2010-02-16 20:29:18 swanson

這是您的答案的修改版本：http://stackoverflow.com/questions/2275901/grouping-related-search-keywords/2277710#2277710 – jfs 2010-02-17 01:24:14

不是一個具體的算法，但你要找的基本上是一個索引，從您的文本行中找到的單詞創建。

因此，您需要某種解析器來識別單詞，然後將它們放入索引結構中，並將每個索引條目鏈接到找到它的行。然後，通過檢索索引條目，你有你的「組」。

來源

2010-02-16 20:09:43 Lucero

好吧，看來你只是想報告每個查詢包含給定的單詞。您可以使用通配符匹配功能在普通的SQL輕鬆地做到這一點，即

SELECT * FROM QUERIES WHERE `querystring` LIKE '%dog%'.

與上面的查詢唯一的問題是，它也發現查詢與查詢字符串，如「dogbah」，你需要寫一對夫婦的替代品使用OR來迎合不同的情況，假設你的話被空白分開。

來源

2010-02-16 20:13:18

你的算法需要以下幾個部分（如果完成你自己）

解析器的數據，分解成行，分解文字中的行。
保存鍵值對（如散列表）的數據結構。關鍵是一個字，值是線的動態陣列（如果你讓你在存儲器指針或行號就足夠了解析的線）

在僞代碼（代）：

create empty set S for name value pairs. 
for each line L parsed 
    for each word W in line L 
    seek W in set S -> Item 
    if not found -> add word W -> (empty array) to set S 
    add line L reference to array in Ietm 
    endfor 
endfor

（查找（字：W））

seek W in set S into Item 
if found return array from Item 
else return empty array.

來源

2010-02-16 20:30:25

改性的@swanson's answer（未版測試）：

from collections import defaultdict 
from itertools import chain 

# generate set of all possible words 
lines = open('data.txt').readlines() 
words = set(chain.from_iterable(line.split() for line in lines)) 

# parse input into groups 
groups = defaultdict(list) 
for line in lines:  
    for word in words: 
     if word in line: 
      groups[word].append(line)

來源

2010-02-17 01:23:35 jfs

分組相關搜索關鍵字

回答

相關問題