2017-03-05 68 views
0
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 

data = pd.read_csv('list.csv') 
print data 

我試圖讓從該表中的數據並計算匹配如何比較熊貓數據框上的項目?

https://i.stack.imgur.com/PMWay.png

我也想以優化大型dataframes代碼,與一個以上的過濾僅門票客戶端:

double_tickets = data.TICKET.value_counts() > 1 
notas_slice = double_tickets[double_tickets == True] 
print notas_slice 

我現在開始使用熊貓,我不知道從哪裏開始解決這個問題。

編輯:

我想計算兩個客戶之間的事件。如在圖像示例(https://i.stack.imgur.com/PMWay.png)中一樣,客戶端14613和43733同時出現在兩個票單中,分兩次出現。

回答

1

可以使用duplicated與參數keep=False所有副本的回報面具 - 2多TICKET值,過濾器由boolean indexing,然後通過locClient選擇和這個面具得到的值:

print (df.TICKET.duplicated(keep=False)) 
0  False 
1  False 
2  True 
3  True 
4  True 
5  False 
6  True 
7  True 
8  False 
9  True 
10  True 
11  True 
12  True 
Name: TICKET, dtype: bool 

print (df.loc[df.TICKET.duplicated(keep=False), 'Client']) 
2  14613 
3  36735 
4  43733 
6  24456 
7  27919 
9  14613 
10 31725 
11 37547 
12 43733 
Name: Client, dtype: int64 

然後value_counts並根據需要過濾boolean indexing再次過濾:

s = df.loc[df.TICKET.duplicated(keep=False), 'Client'].value_counts() 
print (s) 
43733 2 
14613 2 
36735 1 
31725 1 
37547 1 
24456 1 
27919 1 
Name: Client, dtype: int64 

print (s[s > 1]) 
43733 2 
14613 2 
Name: Client, dtype: int64 

如果需要,最後加上reset_index的轉換SeriesDataFrame

df1 = s[s > 1].reset_index() 
df1.columns = ['Client','Count'] 
print (df1) 
    Client Count 
0 43733  2 
1 14613  2 

解決方案與filtration是slowier:

s = df.groupby('TICKET').filter(lambda x: len(x) > 1)['Client'].value_counts() 
print (s) 
43733 2 
14613 2 
36735 1 
31725 1 
37547 1 
24456 1 
27919 1 
Name: Client, dtype: int64 
In [46]: %timeit (df.loc[df.TICKET.duplicated(keep=False), 'Client'].value_counts()) 
1000 loops, best of 3: 769 µs per loop 

In [47]: %timeit (df.groupby('TICKET').filter(lambda x: len(x) > 1)['Client'].value_counts()) 
100 loops, best of 3: 2.55 ms per loop 

#[1300000 rows x 2 columns] 
df = pd.concat([df]*100000).reset_index(drop=True) 
#print (df) 

In [53]: %timeit (df.loc[df.TICKET.duplicated(keep=False), 'Client'].value_counts()) 
10 loops, best of 3: 54.8 ms per loop 

In [54]: %timeit (df.groupby('TICKET').filter(lambda x: len(x) > 1)['Client'].value_counts()) 
1 loop, best of 3: 282 ms per loop 
+0

Hummm ..實際上,我想要計算兩個客戶之間的事件。正如圖像示例(https://i.stack.imgur.com/PMWay.png)中一樣,客戶端14613和43733同時出現在兩個票單中,分兩次出現。 – EnigmA