2017-04-10 61 views
0

我正在寫一個簡單的程序,它應該將一個.tsv文件解析爲多個.csv文件。問題在於它耗時如此之久(我認爲〜5萬行9分鐘是可怕的表現)。請有人看看我的代碼,並告訴我我做錯了什麼?R迭代通過50k數據幀花了很長時間

我有一個表,其中包含name of participant,name of media,timestamp,和一些座標數據。在我的數據中可以有一個或多個參與者,每個參與者使用兩個媒體文件。並且我想爲每個media files創建csv文件與具體的參與者一起工作。

比如我有2名人蔘加P1P2和每個工作中的媒體文件M1M2。所以我想創建P1_M1.csv,P1_M2.csv,P2_M1.csv,P2_M2.csv

的數據是這樣的:

P1 | M1 | data... 
P1 | M1 | data... 
... 
P1 | M2 | data... 
... 
P2 | m1 | data... 
... 
... 

這裏是我的代碼:

data = read.table("./data.tsv", header = T, sep = "\t", stringsAsFactors = F) # load data from tsv 

# function for creating csv file 
writeData = function(filename, d){ 
    filename = paste("./", filename, ".csv", sep = "") 
    write.csv(d, file = filename, row.names = F) 
} 

# initialize auxiliary variables 
participantName = "" 
mediaName = "" 
# initialize empty dataframe 
subdata <- data.frame(TimeStamp = numeric(), GazeLeftX = integer(), GazeLeftY = integer(), GazeRightX = integer(), GazeRightY = integer()) 

# for each row in original data... 
for(r in 1:nrow(data)) 
{ 
    # check if last participant is same as participant on actual row 
    if(participantName != data[r, 'ParticipantName']){ 
    # check if last participant is not empty (like no participant was processed yet) 
    if(participantName != ""){ 
     # if it is not than participant and also his work on media file ended so write data to csv 
     writeData(filename = paste(participantName,"_",mediaName, sep = ""), d = subdata) 
     # empty auxiliary dataframe and also mediaName 
     subdata = subdata[0,] 
     mediaName = "" 
    } 
    # we detected new participant so record it into last participant variable 
    participantName = data[r, 'ParticipantName'] 
    } 
    # do same checks for media file because there can also change only mediafile and participant can be the same 
    if(mediaName != data[r, 'MediaName']){ 
    if(mediaName != ""){ 
     writeData(filename = paste(participantName,"_",mediaName, sep = ""), d = subdata) 
     subdata = subdata[0,] 
    } 
    mediaName = data[r, 'MediaName'] 
    } 
    # in every iteration append actual row into auxilliary dataframe 
    subdata = rbind(subdata, 
        TimeStamp = data.frame(data[r, 'EyeTrackerTimestamp'], 
        GazeLeftX = data[r, 'GazeLeftX'], 
        GazeLeftY = data[r, 'GazeLeftY'], 
        GazeRightX = data[r, 'GazeRightX'], 
        GazeRightY = data[r, 'GazeRightY'])) 
} 
# if there are any data left in auxiliary dataframe, save it to csv 
if(nrow(subdata) != 0){ 
    writeData(filename = paste(participantName,"_",mediaName, sep = ""), d = subdata) 
} 
+3

請參閱'?split'。嘗試實例'split(data,data [,c(「ParticipantName」,「MediaName」)])'。 – nicola

+0

@nicola非常感謝你。太棒了。如果你願意,你可以發表一個答案,我會將其標記爲解決方案。現在我只有一個問題,我的代碼只創建一個csv文件,但在我的代碼中可能只是一些愚蠢的錯誤:) – Gondil

回答

1

您正在尋找?split。嘗試例如:

split(data,data[,c("ParticipantName","MediaName")],drop=TRUE) 

,將創建一個list包含data.frame每個ParticipantName - MediaName對。如果你想要寫在不同的文件中的每個數據幀,你可以嘗試這樣的:

res<-split(data,data[,c("ParticipantName","MediaName")],drop=TRUE) 
Map(writeData,names(res),res) 

其中writeData是你定義的功能。