在csv文件中創建一個重複的行以分隔列中的多個值（python）

我想在Python中構建一些代碼以將列中的多個值分隔爲單獨的行，並基於同一天聚合Active-Ticket列對於時間戳，是否有任何內部庫可以使用，或者我需要安裝外部庫嗎？在csv文件中創建一個重複的行以分隔列中的多個值（python）

我的樣本文件（就目前來看，主動式門票欄爲空）：

Input.csv

Timestamp,CaseID,Active-Tickets 
14FEB2017:10:55:23,K456 G578 T213,   
13FEB2017:10:56:12,F891 A63, 
14FEB2017:11:59:14,T427 T31212 F900000, 
15FEB2017:03:55:23,K456 G578 T213,   
14FEB2017:05:56:12,F891 A63,

我想實現：

輸出.csv

Timestamp,CaseID,Active-Tickets 
14FEB2017:10:55:23,K456,8 (because there are 8 cases happened on the same day) 
14FEB2017:10:55:23,G578,8 
14FEB2017:10:55:23,T213,8   
13FEB2017:10:56:12,F891,2 (because there are 2 cases happened on the same day) 
13FEB2017:10:56:12,A63,2 
14FEB2017:11:59:14,T427,8 
14FEB2017:11:59:14,T31212,8 
14FEB2017:11:59:14,F900000,8 
15FEB2017:03:55:23,K456,3 (because there are 3 cases happened on the same day) 
15FEB2017:03:55:23,G578,3 
15FEB2017:03:55:23,T213,3   
14FEB2017:05:56:12,F891,8 
14FEB2017:05:56:12,A63,8

我的想法是：

Take the values for the column Timestamp

Check if the date is the same,

Store all of the CaseID separated by space into a list based on the date,

Count the number of element in the list for each date then

Return the values for the counted elements into Active-Tickets .

但這裏的問題是，數據量也不小，假設有50案件最小的一天，那麼我不認爲我的方式是可能的。

來源

2017-04-26 yunaranyancat

我會使用由日期字段索引的散列值，其中值將是由數字作爲值的CaseID索引的散列值。使用defaultdict實現它應該很容易，只要它能適應內存。如果你的數據真的很大，你可以看看'shelve'或'sqlite3'模塊。但這個問題目前相當廣泛... –

@SergeBallesta謝謝。我會看看這些模塊。數據大約是每個文件2GB，因此我只能測試包含數千個樣本的樣本。當你在談論sqlite3模塊時，這是否意味着我需要爲csv文件創建一個數據庫？ – yunaranyancat

這取決於輸入文件是按日期/時間戳排序的。如果是這樣，你只需要在記憶中保持一天，所以只要忘記擱置或sqlite3。如果順序是隨機的，你可以首先使用外部排序程序回退到第一個用例，或者直接在磁盤上存儲數據庫，所以可以將csv文件的相關部分存儲到sqlite3數據庫並使用查詢'GROUP BY dat，caseid'。 –

這是使用itertools.chain.from_iterable()來做到這一點的一種方法。它只保留內存中的計數，所以可能適用於你的情況。它讀取兩次csv文件。一旦獲得計數，並且一次寫入輸出，但是隻使用迭代器進行讀取，那麼應該保持內存需求不變。

代碼：

import csv 
import itertools as it 
from collections import Counter 

# read through file and get counts per date 
with open('test.csv', 'rU') as f: 
    reader = csv.reader(f) 
    header = next(reader) 
    dates = it.chain.from_iterable(
     [date for _ in ids.split()] 
     for date, ids in ((x[0].split(':')[0], x[1]) for x in reader)) 
    counts = Counter(dates) 

# read through file again, and output as individual records with counts 
with open('test.csv', 'rU') as f: 
    reader = csv.reader(f) 
    header = next(reader) 
    records = it.chain.from_iterable(
     [(l[0], d) for d in l[1].split()] for l in reader) 
    new_lines = (l + (str(counts[l[0].split(':')[0]]),) for l in records) 

    with open('test2.csv', 'wb') as f_out: 
     writer = csv.writer(f_out) 
     writer.writerow(header) 
     writer.writerows(new_lines)

結果：

Timestamp,CaseID,Active-Tickets 
14FEB2017:10:55:23,K456,8 
14FEB2017:10:55:23,G578,8 
14FEB2017:10:55:23,T213,8 
13FEB2017:10:56:12,F891,2 
13FEB2017:10:56:12,A63,2 
14FEB2017:11:59:14,T427,8 
14FEB2017:11:59:14,T31212,8 
14FEB2017:11:59:14,F900000,8 
15FEB2017:03:55:23,K456,3 
15FEB2017:03:55:23,G578,3 
15FEB2017:03:55:23,T213,3 
14FEB2017:05:56:12,F891,8 
14FEB2017:05:56:12,A63,8

計數器在2.6

collections.Counter一直蟒蛇向後移植2.5+（Here）

來源

2017-04-27 00:19:08

嗨，我得到以下錯誤。 '_csv.Error：行包含NULL字節'。這是否意味着我的CaseID不能包含0數據？那麼，我應該首先刪除CaseID列中包含空數據的行，然後在運行此代碼之前在CaseID中輸出一個包含非NULL數據的新csv文件？或者這不是什麼困擾我？ – yunaranyancat

我已經解決了它並解決了我的問題。謝謝！ – yunaranyancat

在csv文件中創建一個重複的行以分隔列中的多個值（python）

回答

相關問題