如何彙總Google AppEngine中的數據

我試圖使用AppEngine實現一個大型（ish）數據集的彙總視圖。如何彙總Google AppEngine中的數據

我的模型看起來像：

def TxRecord(db.Model): 
    expense_type = db.StringProperty() 
    amount = db.IntegerProperty() 

def ExpenseType(db.Model): 
    name = db.StringProperty() 
    total = db.IntegerProperty()

我的數據存儲包含的TxRecord 100K情況下，我想通過expense_type總結這些。

在SQL它會是這樣的：

select expense_type as name, sum(amount) as total 
    from TxRecord 
    group by expense_type

什麼我目前做的是使用Python MapReduce framework遍歷所有TxRecords的使用下面的映射：

def generate_expense_type(rec): 
    expense_type = type.get_or_insert(name, name = rec.expense_type) 
    expense_type.total += rec.amount 

    yield op.db.Put(expense_type)

這似乎工作，但我覺得我必須使用1的shard_count來運行它，以確保總數不會被寫入併發寫入。

有沒有一種策略可以用來使用AppEngine來解決這個問題或者它是什麼？

來源

2011-03-27 Gareth Davis

使用mapreduce是正確的方法。正如David所言，計數器是一種選擇，但它們不可靠（它們使用memcache），而且它們並不是爲大量計數器並行設計的。

您當前的mapreduce有幾個問題：首先，get_or_insert每次調用時都會執行數據存儲事務。其次，您然後更新事務之外的數量並第二次異步存儲它，生成您所關注的併發問題。

至少要等到減少是完全支持，最好的選擇是做全更新的映射器在一個事務中，像這樣：

def generate_expense_type(rec): 
    def _tx(): 
     expense_type = type.get(name) 
     if not expense_type: 
     expense_type = type(key_name=name) 
     expense_type.total += rec.amount 
     expense_type.put() 
    db.run_in_transaction(expense_type)

來源

2011-03-28 01:57:31

使用MapReduce框架是一個好主意。如果利用MapReduce框架提供的計數器，則可以使用多個分片。因此，而不是每次修改數據存儲，你可以做這樣的事情：

yield op.counters.Increment("total_<expense_type_name>", rec.amount)

MapReduce的結束（希望更快地當你只用一個碎片比）後，你就可以將最終確定計數器複製到你的數據存儲實體。

來源

2011-03-27 09:50:18

我目前正在嘗試使用內存緩存條目類似的東西。我無法與op.counters一起工作的是如何獲得回調處理程序中的計數器......是否有時間處理另一個問題？ – 2011-03-27 11:29:22

MapReduce非常適合脫機處理數據，我喜歡David的處理計數器的解決方案（+1 upvote）。

我只是想提另外一個選擇：處理數據，因爲它來自於從2010年開始IO退房佈雷特·斯拉特金的High Throughput Data Pipelines on App Engine談話

我實現了在一個簡單的框架（slagg）的技術，你可能找到我的例子grouping with date rollup useful。

來源

2011-03-27 17:16:46

如何彙總Google AppEngine中的數據

回答

相關問題