2017-10-13 139 views
1

我試圖在多個csv文件的目錄中讀取,每個文件約爲7K +行和〜1800列。我有一個數據字典,可以讀入數據框,數據字典的每一行都標識變量(列)名稱以及數據類型。使用數據框中的值指定read_csv中的列類型

查看readr包中的?read_csv,可以指定列類型。但是,鑑於我有近1800列指定,我希望使用可用數據字典中的信息來指定該函數所需的適當格式的列/類型對。

另一種不太理想的方法是將每一列讀作字符,然後根據需要手動修改。

任何幫助,你可以提供關於如何指定列類型將不勝感激。

如果有幫助,這裏是我的代碼來獲取和哄數據字典到我指的格式。

## Get the data dictionary 
URL = "https://collegescorecard.ed.gov/assets/CollegeScorecardDataDictionary.xlsx" 
download.file(URL, destfile="raw-data/dictionary.xlsx") 

## read in the dictionary to get the variables 
dict = read_excel("raw-data/dictionary.xlsx", sheet = "data_dictionary") 
colnames(dict) = tolower(gsub(" ", "_", colnames(dict))) 
dict = dict %>% filter(!is.na(variable_name)) 

## create a data dictionary 
## https://stackoverflow.com/questions/46738968/specify-column-types-in-read-csv-by-using-values-in-a-dataframe/46742411#46742411 
dict <- dict %>% mutate(variable_type = case_when(api_data_type == "integer" ~ "i", 
                api_data_type == "autocomplete" ~ "c", #assumption that this is a string 
                api_data_type == "string" ~ "c", 
                api_data_type == "float" ~ "d")) 

回報:

> ## read in the dictionary to get the variables 
> dict = read_excel("raw-data/dictionary.xlsx", sheet = "data_dictionary") 
> colnames(dict) = tolower(gsub(" ", "_", colnames(dict))) 
> dict = dict %>% filter(!is.na(variable_name)) 
> dict <- dict %>% mutate(variable_type = case_when(api_data_type == "integer" ~ "i", 
+             api_data_type == "autocomplete" ~ "c", #assumption that this is a string 
+             api_data_type == "string" ~ "c", 
+             api_data_type == "float" ~ "d")) 
Error: object 'api_data_type' not found 

和我sessionInfo

> sessionInfo() 
R version 3.3.1 (2016-06-21) 
Platform: x86_64-apple-darwin13.4.0 (64-bit) 
Running under: OS X 10.11.6 (El Capitan) 

locale: 
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 

attached base packages: 
[1] stats  graphics grDevices utils  datasets methods base  

other attached packages: 
[1] stringr_1.2.0 readxl_0.1.1 readr_1.1.0 dplyr_0.5.0 

loaded via a namespace (and not attached): 
[1] rjson_0.2.15 lazyeval_0.2.0 magrittr_1.5 R6_2.2.2  assertthat_0.1 hms_0.2  DBI_0.7  tools_3.3.1 
[9] tibble_1.2  yaml_2.1.14 Rcpp_0.12.11 stringi_1.1.5 jsonlite_1.5 
+0

我不久將發佈 「完全」 可重複的解決方案。 – Jas

+0

也許你必須升級你的dplyr版本。我有v0.7.4 – Jas

回答

1

您可以使用mutatecase_when組合來映射使用緊湊的字符串表示api_data_type列。這是每個列類型由單個字母表示的地方:c =字符,i =整數,n =數字,d =雙倍,l =邏輯等現在,此字符向量可用於參數read_csv

## Load libraries 
library(dplyr) 
library(readxl) 

## Get the data dictionary 
URL = "https://collegescorecard.ed.gov/assets/CollegeScorecardDataDictionary.xlsx" 
download.file(URL, destfile="raw-data/dictionary.xlsx") 

## read in the dictionary to get the variables 
dict = read_excel("raw-data/dictionary.xlsx", sheet = "data_dictionary") 
colnames(dict) = tolower(gsub(" ", "_", colnames(dict))) 
dict = dict %>% filter(!is.na(variable_name)) 

unique(dict$api_data_type) 
#> [1] "integer"  "autocomplete" "string"  "float" 

dict <- dict %>% mutate(variable_type = case_when(api_data_type == "integer" ~ "i", 
                api_data_type == "autocomplete" ~ "c", #assumption that this is a string 
                api_data_type == "string" ~ "c", 
                api_data_type == "float" ~ "d" 
               ) 
         ) 
cnames <- dict %>% select(variable_name) %>% pull 
head(cnames) 
#> [1] "UNITID" "OPEID" "OPEID6" "INSTNM" "CITY" "STABBR" 
ctypes <- dict %>% select(variable_type) %>% pull 
head(ctypes) 
#> [1] "i" "i" "i" "c" "c" "c" 
+0

看到上面的更新。我想擴展你在給我你的建議時得到的代碼。看到錯誤,但我不知道'case_when',所以+100這個用例 – Btibert3

+0

沒問題,再試一次這個完全可重現的例子。請記住在運行代碼之前重新啓動會話。 – Jas

+0

我遇到了列不符合數據字典的問題,但這非常有幫助。非常感激 – Btibert3

相關問題