2010-12-08 40 views
1

比方說,你有一個這樣的數組:通過集合上的聚合函數實現類似SQL的group-by的算法?

[ 
    {'id' : 1, 'closed' : 1 }, 
    {'id' : 2, 'closed' : 1 }, 
    {'id' : 5, 'closed' : 1 }, 
    {'id' : 7, 'closed' : 0 }, 
    {'id' : 8, 'closed' : 0 }, 
    {'id' : 9, 'closed' : 1 } 
] 

我想總結一下這個數據集(不使用SQL!),並且抓住了minmax ID,由該行的變化定義的每個組'closed'。在輸出產生這樣的:

[ 
    {'id__min' : 1, 'id__max' : 5, 'closed' : 1}, 
    {'id__min' : 7, 'id__max' : 8, 'closed' : 0}, 
    {'id__min' : 9, 'id__max' : 9, 'closed' : 1} 
] 

這僅僅是想我做的一個例子。我想實現類似於python的itertools.groupby提供的東西,但要更全面。 (想定義我自己的聚合函數)。

我正在尋找指針,僞代碼,甚至任何PHP,Python或Javascript代碼(如果可能)。

謝謝!

回答

2

key參數itertools.groupby()允許您傳遞自己的聚合函數。

+0

我知道,我正在尋找一種通用的方式來實現另一種語言(PHP現在)。 – 2010-12-08 21:10:36

+0

該文檔給出了低級代碼中等效的功能。隨意轉換。 – 2010-12-08 21:13:52

1

Ruby代碼:

def summarise array_of_hashes 
    #first sort the list by id 
    arr = array_of_hashes.sort {|a, b| a['id'] <=> b['id'] } 
    #create a hash with id_min and id_max set to the id of the first 
    #array element and closed to the closed of the first array element 
    hash = {} 
    hash['id_min'] = hash['id_max'] = arr[0]['id'] 
    hash['closed'] = arr[0]['closed'] 
    #prepare an output array 
    output = [] 
    #iterate over the array elements 
    arr.each do |el| 
     if el['closed'] == hash['closed'] 
      #update id_max while the id value is the same 
      hash['id_max'] = el['id'] 
     else #once it is different 
      output.push hash #add the hash to the output array 
      hash = {} #create a new hash in place of the old one 
      #and initiate its keys to the appropriate values 
      hash['id_min'] = hash['id_max'] = el['id'] 
      hash['closed'] = el['closed'] 
     end 
    end 
    output.push hash #make sure the final hash is added to the output array 
    #return the output array 
    output 
end 

廣義版本:

def summarise data, condition, group_func 
    #store the first hash in a variable to compare t 
    pivot = data[0] 
    to_group = [] 
    output = [] 
    #iterate through array 
    data.each do |datum| 
     #if the comparison of this datum to the pivot datum fits the condition 
     if condition.call(pivot, datum) 
      #add this datum to the to_group list 
      to_group.push datum 
     else #once the condition no longer matches 
      #apply the aggregating function to the list to group and add it to the output array 
      output.push group_func.call(to_group) 
      #reset the to_group list and add this element to it 
      to_group = [datum] 
      #set the pivot to this element 
      pivot = datum 
     end 
    end 
    #make sure the final list to group are grouped and added to the output list 
    output.push group_func.call(to_group) 
    #return the output list 
    output 
end 

下面的代碼,然後將你的例如工作:

my_condition = lambda do |a, b| 
    b['closed'] == a['closed'] 
end 

my_group_func = lambda do |to_group| 
    { 
     'id_min' => to_group[0]['id'], 
     'id_max' => to_group[to_group.length-1]['id'], 
     'closed' => to_group[0]['closed'] 
    } 
end 

summarise(my_array.sort {|a, b| a['id'] <=> b['id']}, my_condition, my_group_func) 

廣義的算法將在任何語言工作它允許將函數作爲參數傳遞給其他函數。如果使用正確的條件和聚合函數,它也可以處理任何數據類型的變量數組。

+0

這有效,但我需要的是概括該方法。這意味着能夠通過我自己的聚合函數或定義以標準方式創建新分組的條件。 – 2010-12-08 21:46:04

+0

我編輯了我的答案,給出了一個通用版本。我希望代碼+註釋足以允許移植到其他語言。 – david4dev 2010-12-08 22:28:12

0

的Ruby代碼的PHP版本較籠統的命名和編號訂單處理:

$input = array(
    array('id' => 3, 'closed' => 1), 
    array('id' => 2, 'closed' => 1), 
    array('id' => 5, 'closed' => 1), 
    array('id' => 7, 'closed' => 0), 
    array('id' => 8, 'closed' => 0), 
    array('id' => 9, 'closed' => 1) 
); 

$output = min_max_group($input, 'id', 'closed'); 
echo '<pre>'; print_r($output); echo '</pre>'; 

function min_max_group($array, $name, $group_by) 
{ 
    $output = array(); 

    $tmp[$name.'__max'] = $tmp[$name.'__min'] = $array[0][$name]; 
    $tmp[$group_by] = $array[0][$group_by]; 

    foreach($array as $value) 
    { 
     if($value[$group_by] == $tmp[$group_by]) 
     { 
      if($value[$name] < $tmp[$name.'__min']) { $tmp[$name.'__min'] = $value[$name]; } 
      if($value[$name] > $tmp[$name.'__max']) { $tmp[$name.'__max'] = $value[$name]; } 
     } 
     else 
     { 
      $output[] = $tmp; 

      $tmp[$name.'__max'] = $tmp[$name.'__min'] = $value[$name]; 
      $tmp[$group_by] = $value[$group_by]; 

      if($value[$name] < $tmp[$name.'__min']) { $tmp[$name.'__min'] = $value[$name]; } 
      if($value[$name] > $tmp[$name.'__max']) { $tmp[$name.'__max'] = $value[$name]; } 
     } 
    } 

    $output[] = $tmp; 

    return $output; 
} 
0

也許我誤解的問題,但不是這只是一個標準的map/reduce問題?