2017-10-10 155 views
-1

我是一個非常新的Python,我試圖分析數據集中的數據。Python:從數據集中收集數據

比方說,我有一個特定的食物品嚐的數據集。例如:

{'review/appearance': 2.5, 'food/style': 'Cook', 'review/taste': 1.5, 'food/type': 'Vegetable' .... } 
{'review/appearance': 5.0, 'food/style': 'Instant', 'review/taste': 4.5, 'food/type': 'Noodle' ....} 

,我有這些條目50000,我試圖去尋找食物的多少不同類型有在下面的代碼中鍵入:

data = list(parseData("/Path/to/my/dataset/file")) 

def feature(datum): 
    feat = [datum['food/type']] 
    return feat 

#making a separate list of food style 
foodStyle = [feature(d) for d in data] 

newFoodStyle = list() 

#converting the foodStyle list to just one list 
for sublist in foodStyle: 
    for item in sublist: 
    newFoodStyle.append(item) 

uniqueFood = Counter(newFoodStyle) #using counter variable to count how many unique food type there are 

a = "There are %s types of food" % (len(uniqueFood)) 
print a 

#print uniqueFood gives me 'Counter({'Noodle': 4352, 'Vegetable': 3412 and etc}) 

現在,我得到了多少有不同的食物類型, 我需要很多幫助來計算數據集中每種獨特食物的「評論/味道」的平均值。

我知道有50K項,所以我想只分析最審查食物前10

我需要循環的每個條目,並查找每個uniqueFood變量,使每個uniqueFood的單獨列表,例如Noodle = list []並追加以下'review/taste'編號?

任何有關如何解決這個問題的提示或想法將不勝感激。

+0

嘗試使用集合並設置長度https://docs.python.org/2/library/sets.htm升 – SatanDmytro

回答

0

您還可以使用dict類型:

data = list(parseData("/Path/to/my/dataset/file")) 

food_items = dict() 
for datum in data: 
    food_style = datum['food/type'] 
    if food_style in food_items: 
     food_items[food_style].append(datum) 
    else: 
     food_items[food_style] = [datum] 

# unique food list 
unique_food = food_items.keys() 


a = "There are %s types of food" % (len(unique_food)) 
print a 

# avg 'review/taste' 
avg = { 
    key: sum(map(lambda i: i.get('review/taste', 0), values))/float(len(values)) 
    for key, values in food_items.items() 
    if values 
} 
0

我會建議將數據轉化爲大熊貓數據框中,然後你可以做的排序和平均值很容易 - 例如低於:

import pandas as pd 

datalist = [] 

dict1 = {'review/appearance': 2.5, 'food/style': 'Cook', 'review/taste': 1.5, 'food/type': 'Vegetable'} 
dict2 = {'review/appearance': 5.0, 'food/style': 'Instant', 'review/taste': 4.5, 'food/type': 'Noodle'} 
dict2 = {'review/appearance': 3.0, 'food/style': 'Instant', 'review/taste': 3.5, 'food/type': 'Noodle'} 

datalist.append(dict1) 
datalist.append(dict2) 

resultsDF = pd.DataFrame(datalist) 

print(resultsDF.head()) 

AverageResults = resultsDF.groupby(["food/style","food/type"])["review/taste"].mean().reset_index() 
print(AverageResults) 

結果:

food/style food/type review/taste 
0  Cook Vegetable   1.5 
1 Instant  Noodle   3.5