聚合並減少嵌套的文檔和陣列

編輯：我們的使用案例：我們從服務器獲取關於訪問者的持續報告。在將這些「報告」插入到MongoDB之後，我們將服務器上的數據預先聚合幾秒鐘。聚合並減少嵌套的文檔和陣列

在我們的儀表板中，我們希望根據時間範圍查詢不同的瀏覽器，操作系統，地理位置（國家等）。

所以就像：在過去的7天裏，有1000名訪問者使用Chrome瀏覽器，500來自德國，200來自英國等等。

我很困擾我們儀表板所需的MongoDB查詢。

，我們有以下報告條目：

{ 
    "_id" : ObjectId("59b9d08e402025326e1a0f30"), 
    "channel_perm_id" : "c361049fb4144b0e81b71c0b6cfdc296", 
    "source_id" : "insomnia", 
    "start_timestamp" : ISODate("2017-09-14T00:42:54.510Z"), 
    "end_timestamp" : ISODate("2017-09-14T00:42:54.510Z"), 
    "timestamp" : ISODate("2017-09-14T00:42:54.510Z"), 
    "resource_uri" : "b755d62a-8c0a-4e8a-945f-41782c13535b", 
    "sources_info" : { 
     "browsers" : [ 
      { 
       "name" : "Chrome", 
       "count" : NumberLong(2) 
      } 
     ], 
     "operating_systems" : [ 
      { 
       "name" : "Mac OS X", 
       "count" : NumberLong(2) 
      } 
     ], 
     "continent_ids" : [ 
      { 
       "name" : "EU", 
       "count" : NumberLong(1) 
      } 
     ], 
     "country_ids" : [ 
      { 
       "name" : "DE", 
       "count" : NumberLong(1) 
      } 
     ], 
     "city_ids" : [ 
      { 
       "name" : "Solingen", 
       "count" : NumberLong(1) 
      } 
     ] 
    }, 
    "unique_sources" : NumberLong(1), 
    "requests" : NumberLong(1), 
    "cache_hits" : NumberLong(0), 
    "cache_misses" : NumberLong(1), 
    "cache_hit_size" : NumberLong(0), 
    "cache_refill_size" : NumberLong("170000000000") 
}

現在，我們需要聚合基於時間戳這些報告。到目前爲止，那麼容易：

db.channel_report.aggregate([{ 
    $group: { 
    _id: { 
     $dateToString: { 
     format: "%Y", 
     date: "$timestamp" 
     } 
    }, 
    sources_info: { 
     $push: "$sources_info" 
    } 
    }, 
}];

但現在對我來說變得困難。正如您可能已經注意到的那樣，sources_info對象是問題所在。

我們需要實際積累它，而不是將所有來源信息「推」到每個組的數組中。

所以，如果我們有這樣的事情：

{ 
    sources_info: [ 
    { 
     browsers: [ 
     { 
      name: "Chrome, 
      count: 1 
     } 
     ] 
    }, 
    { 
     browsers: [ 
     { 
      name: "Chrome, 
      count: 1 
     } 
     ] 
    } 
    ] 
}

數組應減少到這一點：

{ 
    sources_info: 
    { 
     browsers: [ 
     { 
      name: "Chrome, 
      count: 2 
     } 
     ] 
    } 
}

我們從MySQL遷移到MongoDB中的分析，但我不知道如何在Mongo中對此行爲進行建模。關於文檔，我幾乎認爲這是不可能的，至少與目前的數據結構不同。

有沒有一個很好的解決方案呢？或者甚至可能是一種不同類型的數據結構？

乾杯，克里斯從StriveCDN

來源

2017-09-14 Kr0e

嘗試$放鬆，然後分組？ –

試過了，但我們實際上是在談論5個不同的分組，然後，對吧？如瀏覽器，operating_system，conti ...等，它有點感覺錯誤和緩慢。如果你能勾畫你的計劃，那將非常棒！ – Kr0e

你需要更具體。你的示例文檔裏面有更多的「瀏覽器」，所以你使用的'$ push'應該簡單地將每個文檔中的每個屬性都推送到結果數組中。它也必然打破任何合理數量的數據。所以如果你實際上期望的是將所有關鍵詞都歸結爲他們的重要性，那麼你真的需要在你的問題中這麼說。同樣也不清楚爲什麼這些屬性中的任何一個首先出現在數組中。因此，詳細說明並充分解釋您的預期結果並不會有什麼壞處。 –

你有基本的問題是，你正在使用「命名的鍵」，您或許真的應該不是使用值一致的屬性路徑。這意味着代替像"browsers"這樣的密鑰，這可能應該簡單地在每個條目上都是"type": "browser"等等。

對此的推理應該在聚合數據的一般方法上變得明顯。它也確實有助於查詢。但是這些方法基本上涉及將初始數據格式強制轉換爲這種結構，以便首先進行聚合。

有了最新的版本（3.4.4的MongoDB和更大），我們可以通過$objectToArray你命名的鍵的作用和操作如下：

db.channel_report.aggregate([ 
    { "$project": { 
    "timestamp": 1, 
    "sources": { 
     "$reduce": { 
     "input": { 
      "$map": { 
      "input": { "$objectToArray": "$sources_info" }, 
      "as": "s", 
      "in": { 
       "$map": { 
       "input": "$$s.v", 
       "as": "v", 
       "in": { 
        "type": "$$s.k", 
        "name": "$$v.name", 
        "count": "$$v.count"  
       } 
       } 
      } 
      }  
     }, 
     "initialValue": [], 
     "in": { "$concatArrays": ["$$value", "$$this"] } 
     } 
    } 
    }}, 
    { "$unwind": "$sources" }, 
    { "$group": { 
    "_id": { 
     "year": { "$year": "$timestamp" }, 
     "type": "$sources.type", 
     "name": "$sources.name" 
    }, 
    "count": { "$sum": "$sources.count" } 
    }}, 
    { "$group": { 
    "_id": { "year": "$_id.year", "type": "$_id.type" }, 
    "v": { "$push": { "name": "$_id.name", "count": "$count" } } 
    }}, 
    { "$group": { 
    "_id": "$_id.year", 
    "sources_info": { 
     "$push": { "k": "$_id.type", "v": "$v" } 
    } 
    }}, 
    { "$addFields": { 
    "sources_info": { "$arrayToObject": "$sources_info" } 
    }} 
])

採取這一回缺口的MongoDB 3。4（這應該是現在默認情況下，大多數託管服務），你可以交替手動聲明每個鍵名：

db.channel_report.aggregate([ 
    { "$project": { 
    "timestamp": 1, 
    "sources": { 
     "$concatArrays": [ 
     { "$map": { 
      "input": "$sources_info.browsers", 
      "in": { 
      "type": "browsers", 
      "name": "$$this.name", 
      "count": "$$this.count" 
      } 
     }}, 
     { "$map": { 
      "input": "$sources_info.operating_systems", 
      "in": { 
      "type": "operating_systems", 
      "name": "$$this.name", 
      "count": "$$this.count" 
      } 
     }}, 
     { "$map": { 
      "input": "$sources_info.continent_ids", 
      "in": { 
      "type": "continent_ids", 
      "name": "$$this.name", 
      "count": "$$this.count" 
      } 
     }}, 
     { "$map": { 
      "input": "$sources_info.country_ids", 
      "in": { 
      "type": "country_ids", 
      "name": "$$this.name", 
      "count": "$$this.count" 
      } 
     }}, 
     { "$map": { 
      "input": "$sources_info.city_ids", 
      "in": { 
      "type": "city_ids", 
      "name": "$$this.name", 
      "count": "$$this.count" 
      } 
     }} 
     ] 
    } 
    }}, 
    { "$unwind": "$sources" }, 
    { "$group": { 
    "_id": { 
     "year": { "$year": "$timestamp" }, 
     "type": "$sources.type", 
     "name": "$sources.name" 
    }, 
    "count": { "$sum": "$sources.count" } 
    }}, 
    { "$group": { 
    "_id": { "year": "$_id.year", "type": "$_id.type" }, 
    "v": { "$push": { "name": "$_id.name", "count": "$count" } } 
    }}, 
    { "$group": { 
    "_id": "$_id.year", 
    "sources": { 
     "$push": { "k": "$_id.type", "v": "$v" } 
    } 
    }}, 
    { "$project": { 
    "sources_info": { 
     "browsers": { 
     "$arrayElemAt": [ 
      "$sources.v", 
      { "$indexOfArray": [ "$sources.k", "browsers" ] } 
     ]  
     }, 
     "operating_systems": { 
     "$arrayElemAt": [ 
      "$sources.v", 
      { "$indexOfArray": [ "$sources.k", "operating_systems" ] } 
     ]  
     }, 
     "continent_ids": { 
     "$arrayElemAt": [ 
      "$sources.v", 
      { "$indexOfArray": [ "$sources.k", "continent_ids" ] } 
     ]  
     }, 
     "country_ids": { 
     "$arrayElemAt": [ 
      "$sources.v", 
      { "$indexOfArray": [ "$sources.k", "country_ids" ] } 
     ]  
     }, 
     "city_ids": { 
     "$arrayElemAt": [ 
      "$sources.v", 
      { "$indexOfArray": [ "$sources.k", "city_ids" ] } 
     ]  
     } 
    }  
    }} 
])

我們甚至可以風那回的MongoDB 3.2代替$indexOfArray使用$map和$filter，但一般的做法是最主要的解釋。

串聯陣列

需要發生的主要事情是採取從許多不同的陣列中的數據與名爲鍵，使「單一陣列」與代表每個鍵名"type"屬性。這無疑是數據應該如何存儲在首位，而這兩種方法的第一聚合階段出來是這樣的：

/* 1 */ 
{ 
    "_id" : ObjectId("59b9d08e402025326e1a0f30"), 
    "timestamp" : ISODate("2017-09-14T00:42:54.510Z"), 
    "sources" : [ 
     { 
      "type" : "browsers", 
      "name" : "Chrome", 
      "count" : NumberLong(2) 
     }, 
     { 
      "type" : "operating_systems", 
      "name" : "Mac OS X", 
      "count" : NumberLong(2) 
     }, 
     { 
      "type" : "continent_ids", 
      "name" : "EU", 
      "count" : NumberLong(1) 
     }, 
     { 
      "type" : "country_ids", 
      "name" : "DE", 
      "count" : NumberLong(1) 
     }, 
     { 
      "type" : "city_ids", 
      "name" : "Solingen", 
      "count" : NumberLong(1) 
     } 
    ] 
}

收放集團

要部分數據實際上累計包括那些"type"和"name"屬性來自陣列內的「內部」。無論何時您需要從「陣列內」的文檔中累積文檔，您使用的過程是$unwind，以便能夠將這些值作爲分組鍵的一部分進行訪問。

這意味着，組合陣列上使用$unwind後，還要將$group上這兩個鍵和縮小"timestamp"細節以便$sum的"count"值。由於您之後擁有「子級別」細節（即瀏覽器中瀏覽器的每個瀏覽器名稱），因此您可以使用其他$group流水線階段，逐漸減少分組鍵的粒度，並使用$push將細節累積到數組中。

在任何情況下，省略輸出的最後階段積累的結構出來的：

/* 1 */ 
{ 
    "_id" : 2017, 
    "sources_info" : [ 
     { 
      "k" : "continent_ids", 
      "v" : [ 
       { 
        "name" : "EU", 
        "count" : NumberLong(1) 
       } 
      ] 
     }, 
     { 
      "k" : "city_ids", 
      "v" : [ 
       { 
        "name" : "Solingen", 
        "count" : NumberLong(1) 
       } 
      ] 
     }, 
     { 
      "k" : "country_ids", 
      "v" : [ 
       { 
        "name" : "DE", 
        "count" : NumberLong(1) 
       } 
      ] 
     }, 
     { 
      "k" : "browsers", 
      "v" : [ 
       { 
        "name" : "Chrome", 
        "count" : NumberLong(2) 
       } 
      ] 
     }, 
     { 
      "k" : "operating_systems", 
      "v" : [ 
       { 
        "name" : "Mac OS X", 
        "count" : NumberLong(2) 
       } 
      ] 
     } 
    ] 
}

這確實是數據的最終狀態，但在相同的形式並不表示它本來找到。在這一點上可以說是完整的，因爲任何進一步的處理都只是表面化的，可以再次輸出爲命名密鑰。

輸出到名爲鍵

如圖所示的變化的方法要麼查找由匹配的項名稱的數組項，或通過使用$arrayToObject到陣列內容變換回到與命名鍵的對象。

的替代也簡單地做到這一點非常最後操縱代碼，如圖通過操縱在殼中的光標結果的此.map()例如：

db.channel_report.aggregate([ 
    { "$project": { 
    "timestamp": 1, 
    "sources": { 
     "$reduce": { 
     "input": { 
      "$map": { 
      "input": { "$objectToArray": "$sources_info" }, 
      "as": "s", 
      "in": { 
       "$map": { 
       "input": "$$s.v", 
       "as": "v", 
       "in": { 
        "type": "$$s.k", 
        "name": "$$v.name", 
        "count": "$$v.count"  
       } 
       } 
      } 
      }  
     }, 
     "initialValue": [], 
     "in": { "$concatArrays": ["$$value", "$$this"] } 
     } 
    } 
    }}, 
    { "$unwind": "$sources" }, 
    { "$group": { 
    "_id": { 
     "year": { "$year": "$timestamp" }, 
     "type": "$sources.type", 
     "name": "$sources.name" 
    }, 
    "count": { "$sum": "$sources.count" } 
    }}, 
    { "$group": { 
    "_id": { "year": "$_id.year", "type": "$_id.type" }, 
    "v": { "$push": { "name": "$_id.name", "count": "$count" } } 
    }}, 
    { "$group": { 
    "_id": "$_id.year", 
    "sources_info": { 
     "$push": { "k": "$_id.type", "v": "$v" } 
    } 
    }}, 
    /* 
    { "$addFields": { 
    "sources_info": { "$arrayToObject": "$sources_info" } 
    }} 
    */ 
]).map(d => Object.assign(d,{ 
    "sources_info": d.sources_info.reduce((acc,curr) => 
    Object.assign(acc,{ [curr.k]: curr.v }),{}) 
}))

這當然適用於任何聚合管線的方法。

當然甚至$concatArrays可以與$setUnion只要所有條目具有"name"和"type"唯一的識別組合被替換的（因爲它們似乎是），以及與通過處理修改最終輸出的施加裝置光標，而不是你可以應用該技術，甚至可以追溯到MongoDB 2.6。

最終輸出

以及最終輸出（實際上是聚集當然，但問題只有樣品一個文檔），如圖作爲累加所有的子密鑰並且重構從最後一個樣本輸出：

{ 
    "_id" : 2017, 
    "sources_info" : { 
     "continent_ids" : [ 
      { 
       "name" : "EU", 
       "count" : NumberLong(1) 
      } 
     ], 
     "city_ids" : [ 
      { 
       "name" : "Solingen", 
       "count" : NumberLong(1) 
      } 
     ], 
     "country_ids" : [ 
      { 
       "name" : "DE", 
       "count" : NumberLong(1) 
      } 
     ], 
     "browsers" : [ 
      { 
       "name" : "Chrome", 
       "count" : NumberLong(2) 
      } 
     ], 
     "operating_systems" : [ 
      { 
       "name" : "Mac OS X", 
       "count" : NumberLong(2) 
      } 
     ] 
    } 
}

其中每個sources_info的密鑰下的每個數組條目減少到它共享相同"name"的每個其他條目的累積計數。

來源

2017-09-14 11:31:20

聚合並減少嵌套的文檔和陣列

回答

相關問題