添加一列數據，如果從另一列的元素在字典

我想通過元素爲每個IP地址在字典frequency4元素檢查在它被存儲後，如果該IP地址在文本文件中的數據行中的column[4]中，它將繼續在數據文件中添加該確切ip的字節量。
如果column[8]bytes下包含一個「M」的含義億美元，這將使得M轉換成「* 1000000」等於3300（請參閱下面的文本文件中的數據），請記住，這是文本文件的樣本，該文本文件包含數千行數據。

我要找的輸出是：

Total bytes for ip 172.217.9.133 is 33000000 
Total bytes for ip 205.251.24.253 is 9516 
Total bytes for ip 52.197.234.56 is 14546

CODE

from collections import OrderedDict 
from collections import Counter 

frequency4 = Counter({}) 
ttlbytes = 0 


with open('/Users/rm/Desktop/nettestWsum.txt', 'r') as infile:  
    next(infile) 
    for line in infile:  
     if "Summary:" in line: 
      break 
     try:    
      srcip = line.split()[4].rsplit(':', 1)[0] 
      frequency4[srcip] = frequency4.get(srcip,0) + 1 
      f4 = OrderedDict(frequency4.most_common()) 
      for srcip in f4: 
       ttlbytes += int(line.split()[8]) 
     except(ValueError): 
      pass 
print("\nTotal bytes for ip",srcip, "is:", ttlbytes)  
for srcip, count in f4.items():  
    print("\nIP address from destination:", srcip, "was found:", count, "times.")

DATA文件

Date first seen   Duration Proto  Src IP Addr:Port   Dst IP Addr:Port Packets Bytes Flows 
2017-04-11 07:23:17.880 929.748 UDP  172.217.9.133:443 -> 205.166.231.250:41138  3019 3.3 M  1 
2017-04-11 07:38:40.994  6.676 TCP  205.251.24.253:443 -> 205.166.231.250:24723  16  4758  1 
2017-04-11 07:38:40.994  6.676 TCP  205.251.24.253:443 -> 205.166.231.250:24723  16  4758  1 
2017-04-11 07:38:41.258  6.508 TCP  52.197.234.56:443 -> 205.166.231.250:13712  14  7273  1 
2017-04-11 07:38:41.258  6.508 TCP  52.197.234.56:443 -> 205.166.231.250:13712  14  7273  1 
Summary: total flows: 22709, total bytes: 300760728, total packets: 477467, avg bps: 1336661, avg pps: 265, avg bpp: 629 
Time window: 2017-04-11 07:13:47 - 2017-04-11 07:43:47 
Total flows processed: 22709, Blocks skipped: 0, Bytes read: 1544328 
Sys: 0.372s flows/second: 61045.7 Wall: 0.374s flows/second: 60574.9

來源

2017-04-18 k5man001

我不知道你需要的頻率但鑑於您的輸入這裏是如何獲得所需的輸出：

from collections import Counter 

count = Counter() 

with open('/Users/rm/Desktop/nettestWsum.txt', 'r') as infile: 
    next(infile) 
    for line in infile:  
     if "Summary:" in line: 
      break 

     parts = line.split() 
     srcip = parts[4].rsplit(':', 1)[0] 

     multiplier = 10**6 if parts[9] == 'M' else 1 
     bytes = int(float(parts[8]) * multiplier) 
     count[srcip] += bytes 

for srcip, bytes in count.most_common(): 
    print('Total bytes for ip', srcip, 'is', bytes)

來源

2017-04-18 07:09:14

這肯定工作，但它確實需要齊頭並進，從詞典frequency4因爲我的項目，我還指望該事件的一部分IP地址的數據顯示，並將其從最常見它們排序不太頻繁，所以我也想提供不僅如此，而且它提供的總字節數。 – k5man001

好吧，我不知道，如果你需要編輯同file..if你只是想要處理數據並查看它，您可以使用熊貓進行探索，因爲它具有許多可加快數據處理的功能。

import pandas as pd 
df = pd.read_csv(filepath_or_buffer = '/Users/rm/Desktop/nettestWsum.txt', index_col = False, header = None, skiprows = 1, sep = '\s\s+', skipfooter = 4) 
df.drop(labels = 3, axis = 1, inplace = True) 
# To drop the -> column 
columnnames = 'Date first seen,Duration Proto,Src IP Addr:Port,Dst IP Addr:Port,Packets,Bytes,Flows' 
columnnames = columnnames.split(',') 
df.columns = columnnames

這將數據加載到一個很好的數據框（表）中。我建議你閱讀pandas.read_csv方法here的文檔。要處理數據，您可以嘗試下面的內容。

# converting data with 'M' to numeric data in millions 
df['Bytes'] = df['Bytes'].apply(lambda x: float(x[:-2])*1000000 if x[-1] == 'M' else x) 
df['Bytes'] = pd.to_numeric(df['Bytes']) 
result = df.groupby(by = 'Dst IP Addr:Port').sum()

您的數據將出現在您可以使用的很好的數據框（表格）中。它比循環更快，我認爲，你可以單獨進行測試。以下是加載後的數據外觀。

下面是GROUPBY，您可以調整輸出。我使用的是Spyder IDE，屏幕抓圖來自IDE中的變量資源管理器。您可以通過打印數據框或將其另存爲另一個CSV來將其可視化。

來源

2017-04-18 07:20:30 Impuls3H

我跳過了剩餘的元數據行。如果你需要這些，分別使用readlines或者StringIo來處理它們？ – Impuls3H

謝謝你的幫助，我會更加註意這一點。我每5分鐘就會輸入一次原始數據，因此無法靜態完成任何操作，我可能會將其納入。 – k5man001

數據是否採用完全相同的格式？如果它們具有相同的列，則可以將其添加到數據框中。關於熊貓的有用之處在於，有許多有用的功能涉及數據處理的速度比通過numpy模塊循環更快。我認爲爲了你的目的，像count，標準差，min＆a max這樣的東西會有用嗎？ – Impuls3H

添加一列數據，如果從另一列的元素在字典

回答

相關問題