查找最常用的不同術語集

想象一下由用於描述它們的URL和標籤組成的圖形數據庫。由此我們想要找出哪些標籤集合最經常使用，並確定哪些URL屬於每個標識集合。查找最常用的不同術語集

我試圖創建簡化了這個問題，因爲這樣的cypher數據集：

CREATE (tech:Tag { name: "tech" }), (comp:Tag { name: "computers" }), (programming:Tag { name: "programming" }), (cat:Tag { name: "cats" }), (mice:Tag { name: "mice" }), (u1:Url { name: "http://u1.com" })-[:IS_ABOUT]->(tech), (u1)-[:IS_ABOUT]->(comp), (u1)-[:IS_ABOUT]->(mice), (u2:Url { name: "http://u2.com" })-[:IS_ABOUT]->(mice), (u2)-[:IS_ABOUT]->(cat), (u3:Url { name: "http://u3.com" })-[:IS_ABOUT]->(tech), (u3)-[:IS_ABOUT]->(programming), (u4:Url { name: "http://u4.com" })-[:IS_ABOUT]->(tech), (u4)-[:IS_ABOUT]->(mice), (u4)-[:IS_ABOUT]->(acc:Tag { name: "accessories" })

以此爲參考（neo4j console example here），我們可以看看它和視覺識別最常用的標籤是tech和mice（對此的查詢是微不足道的），都引用3個URL。最常用的標籤對是[tech, mice]，因爲它（在此示例中）是由2個網址（u4和u1）共享的唯一配對。需要注意的是，這個標籤對是匹配網址的一個子集，並不是整個集合。沒有任何網址共享3個標籤的組合。

如何編寫cypher查詢以確定哪些標籤組合最頻繁地一起使用（成對或N個尺寸組）？也許有更好的方法來構建這些數據，這將使分析更容易？或者這個問題不適合Graph DB？一直在試圖找出這一點掙扎，任何幫助或想法，將不勝感激！

來源

2016-09-15 Chris Shorrock

它看起來像組合問題。

// The tags for each URL, sorted by ID 
MATCH (U:Url)-[:IS_ABOUT]->(T:Tag) 
WITH U, T ORDER BY id(T) 
WITH U, 
    collect(distinct T) as TAGS 

// Calc the number of combinations of tags for a node, 
// independent of the order of tags 
// Since the construction of the power in the cyper is not available, 
// use the logarithm and exponent 
// 
WITH U, TAGS, 
    toInt(floor(exp(log(2) * size(TAGS)))) as numberOfCombinations 

// Iterate through all combinations 
UNWIND RANGE(0, numberOfCombinations) as combinationIndex 
WITH U, TAGS, combinationIndex 

// And check for each tag its presence in combination 
// Bitwise operations are missing in the cypher, 
// therefore, we use APOC 
// https://neo4j-contrib.github.io/neo4j-apoc-procedures/#_bitwise_operations 
// 
UNWIND RANGE(0, size(TAGS)-1) as tagIndex 
WITH U, TAGS, combinationIndex, tagIndex, 
    toInt(ceil(exp(log(2) * tagIndex))) as pw2 
    call apoc.bitwise.op(combinationIndex, "&", pw2) YIELD value 
WITH U, TAGS, combinationIndex, tagIndex, 
    value WHERE value > 0 

// Get all combinations of tags for URL 
WITH U, TAGS, combinationIndex, 
    collect(TAGS[tagIndex]) as combination 

// Return all the possible combinations of tags, sorted by frequency of use 
RETURN combination, count(combination) as freq, collect(U) as urls 
     ORDER BY freq DESC

我認爲最好在標記時使用此算法計算和存儲標記組合。並且查詢將如下所示：

MATCH (Comb:TagsCombination)<-[:IS_ABOUT]-(U:Url) 
WITH Comb, collect(U) as urls, count(U) as freq 
MATCH (Comb)-[:CONTAIN]->(T:Tag) 
RETURN Comb, collect(T) as Tags, urls, freq ORDER BY freq DESC

來源

2016-09-16 10:24:19

好東西。真正有趣的方法，並作爲圖新手，需要一段時間才能得到這個解決方案沒有幫助。非常感激！ –

從URL節點開始，構建一個tag.name對象的元組（首先對它進行排序，以便它們都組合在一起）。這會給你所有可能的標籤組合。然後，使用過濾器來找出每個可能的標記集有多少個url匹配。

MATCH (u:url) 
WITH u 
MATCH (u) - [:IS_ABOUT] -> (t:tag) 
WITH u, t 
ORDER BY t.name 
WITH u, [x IN COLLECT(t)|x.name] AS tags 
WITH DISTINCT tags 
MATCH (u) 
WHERE ALL(tag IN tags WHERE (u) - [:IS_ABOUT] -> (tag)) 
RETURN tags, count(u)

來源

2016-09-15 19:40:13

如果您只對特定大小的一對術語感興趣，還可以按大小過濾「標籤」。例如，在'WITH WITH DISTINCT tags'行之後的'WHERE size（tags）= 2'。 –

Thanks @ tore-eschliman，雖然這確實爲我提供了一些很好的見解，但我認爲它的核心問題是它沒有考慮標記子集。即如果'A'用'1,2,3'標記，'B'用'2,3,4'標記 - 它不會將'2,3'識別爲最常見的一對。也許這可以作爲分析事情進一步思考的起點，我會玩弄它。我已經更新了問題中的示例以更好地說明這一點。 –

確實，它只會計算離散的現有標籤集，而不是虛擬子集......但這是一個有趣的問題。 –

查找最常用的不同術語集

回答

相關問題