2013-02-12 36 views
2

我有多個CSV文件貓2層結構的:不一樣的領域

a.csv

field_a, field_b 
111,  121 
112,  122 

b.csv

field_a, field_c 
211,  231 
212,  232 

c.csv

field_a, field_b, field_c 
311,  321,  331 
312,  322,  332 

我想將它們連接起來

output.csv

field_a,field_b,field_c 
111, 121, NA 
112, 122, NA 
211, NA,  231 
212, NA,  232 
311, 321, 331 
312, 322, 332 

我想與倍頻做到這一點。

我做了什麼至今:

a=csv2cell(a.csv) 
A=cell2struct(a(2:end,:),a(1,:),1) 

,現在我正在尋找類似

合併(A,B,C) 或 vertcat(A,B,C)

但我沒有得到它,所有的領域都在輸出。

蒙山R I做到了像這樣:

filelist<-list.files() 
for (i in 1:length(filelist)) { 
    datas[[i]]<-list(as.data.frame(read.csv(filelist[i]))) 
    merged <- merge(merged,datas[[i]], all=TRUE)} 

但for循環是可怕的慢。所以我正在尋找一種可能性來一次性合併它們。

+2

低效R代碼裏面往往是緩慢的。這不是一個真正的合併操作。這是一個堆疊操作。 – 2013-02-12 22:20:32

+0

是的,我不知道更好的方法。 @阿倫有個更好的主意。 – telemachos 2013-02-13 20:48:25

回答

0

我怎麼沒我終於:

隨着八度(MATLAB)

% FileNames=readdir(pwd); 
d=dir(pwd); 

isDirIdx = [d.isdir]; 
names = {d.name}; 
FileNames = names(~isDirIdx); 

for ii = 1:numel(FileNames) 
    % Load csv to cell 
    datas{ii}=csv2cell(FileNames{ii}); 
    % Then I convert them to a struct 
    Datas{ii}=cell2struct((datas{ii}(2:end,:)),[datas{ii}(1,:)],2); 
    try fields=[fields, fieldnames(Datas{ii})'];% fails for the first loop, becauce 'fields' doesn't exist yet 
    catch 
    fields=[fieldnames(Datas{ii})']; % create 'fields' in the first loop 
    end 
    Datalenght(ii)=numel(Datas{ii}(1)); 
end 

cd(startdir) 

for jj=1:numel(Datas) 
    missing_fields{jj} = setdiff(fields,fieldnames(Datas{jj})); 
    for kk=1:numel(missing_fields{jj}) 
    [Datas{jj}.(missing_fields{jj}{kk})]=deal(NaN);%*zeros(numel(datas{jj}(2:end,1)),1);) 
    end 
end 

的問題是,我沒有看到一個簡單的將結構導出到csv的方法。所以我切換回R.因爲我沒有足夠的內存,我無法加載r中的所有文件並將它們導出爲一個csv。所以,首先我將每個netcdf文件導出到一個csv,其值完全相同。然後我用unix/gnu cat命令連接它們。

R:

# Converts all NetCDF (*.nc) in a folder to ASCII (csv) 
# when there are more then one, all csv will have the same fields 
# when there is a field missing in one NetCDF file, this scripts adds 'NA' Values 

# it saves memory, because there is always only one NetCDF-File in the memory. 

# Needs package RNetCDF: 
# http://cran.r-project.org/web/packages/RNetCDF/index.html 
# load package 
library('RNetCDF') 

# get list of all files to merge 
filelist<-list.files() 

# initialise variable names 
varnames_all<-{} 
varnames_file<-list(filelist) 

n_files<-length(filelist) 
n_vars<-rep(NA,n_files) # initialise 

# get variables-names of each NetCDF file 
for (i in 1:n_files) { 
    ncfile<-open.nc(filelist[i]) # open nc file 
    print(paste(filelist[i],"opend!")) 

    # get number of variable in the NetCDF 
    n_vars[i]<-file.inq.nc(ncfile)$nvars 
    varnames="" # initialise and clear 

    # read every variable name 
    for (j in 0:(n_vars[i]-1)) { 
    varnames[j]<-var.inq.nc(ncfile,j)$name 
    } 
    close.nc(ncfile) 
    varnames_file[[i]]<-varnames # add to the list of all files 
    varnames_all<-(c(varnames_all,varnames)) # concat to one array 
} 

varnames_all<-unique(varnames_all) # take every varname only once 
print("Existing variable names:") 
print(varnames_all) 

#initialise a data.frame for load the NetCDF 
datas<-data.frame() 

for (i in 1:length(filelist)) { 
    print(filelist[i]) 
    ncfile<-open.nc(filelist[i]) # open nc file 
    print(paste("reading ", filelist[i], "...")) 
    datas<-as.data.frame(read.nc(ncfile)) #import data from ncfile as data frame 
    close.nc(ncfile) 

    #check witch variables are missing 
    missing_vars<-setdiff(varnames_all,colnames(datas)) 

    # Add missing variables a colums with NA 
    datas[missing_vars]<-NA 
    print(paste("writing ", filelist[i], " to ", filelist[i],".csv ...", sep="")) 

    #reorder colum in the same way as in the array varname_all 
    datas<-datas[varnames_all] 

    # Write File 
    write.csv(datas,file=paste(filelist[i],".csv", sep="")) 

    # clear Memory 
    rm(datas) 
} 

於是貓是直截了當

#!/bin/bash 
# Concatenate csv files, whitch have exactly the same fields 

## Change to the directory, from where the files is executed 
path=$PWD 
cd $path 

if [ $# -gt 0 ]; then 
    cd $1 
fi 

# get a list of all data files 
datafile_list=$(ls) 
read -a datafile_array <<< $datafile_list 
echo "copying files ..." 
echo "copying file:" ${datafile_array[0]} 

cat < ./${datafile_array[0]} > ../outputCat.csv 
for ((i=1; i<${#datafile_array[@]}; i++)) 
    do 
    echo "copying file" ${datafile_array[$i]} 
    cat < ./${datafile_array[$i]} | tail -n+2 >> ../outputCat.csv 
done 
4

rbind.fillplyr包應該完全處理這個問題:

require(plyr) 
rbind.fill(a,b,c) 

# field_a field_b field_c 
# 1  111  121  NA 
# 2  112  122  NA 
# 3  211  NA  231 
# 4  212  NA  232 
# 5  311  321  331 
# 6  312  322  332 
+0

我認爲這一般工作,但我有我的memoery大小的問題,它會降低錯誤。我必須尋找更好的機器...... – telemachos 2013-02-13 20:46:35

+1

你的數據的大小是多少?你的記憶是什麼?你正在運行一個R 32位版本嗎? (使用'sessionInfo()'來找出最後一部分) – Arun 2013-02-13 20:49:53

+1

如果內存問題是一個問題,那麼你應該考慮使用數據庫並通過sqldf包訪問數據。 – 2013-02-13 20:52:13

1

我不知道八度 - 但在Matlab我會用fieldnames和設置功能。

僞代碼是這樣的:

all_fields = union of fieldnames(a), fieldnames(b) and fieldnames(c) 
for each variable: 
    missing_fields = setdiff(all_fields,fieldnames) 
    add the missing fields 
then join 
+0

好的,我認爲這可能也是獲得相同結構的好方法,但不要連接它們(由於內存大小)。 – telemachos 2013-02-14 08:50:37

+0

但是這會導致一些其他問題:[一次設置多個字段](http://stackoverflow.com/q/14870828/1842684) – telemachos 2013-02-14 08:50:58