我試圖在多個csv文件的目錄中讀取,每個文件約爲7K +行和〜1800列。我有一個數據字典,可以讀入數據框,數據字典的每一行都標識變量(列)名稱以及數據類型。使用數據框中的值指定read_csv中的列類型
查看readr
包中的?read_csv
,可以指定列類型。但是,鑑於我有近1800列指定,我希望使用可用數據字典中的信息來指定該函數所需的適當格式的列/類型對。
另一種不太理想的方法是將每一列讀作字符,然後根據需要手動修改。
任何幫助,你可以提供關於如何指定列類型將不勝感激。
如果有幫助,這裏是我的代碼來獲取和哄數據字典到我指的格式。
## Get the data dictionary
URL = "https://collegescorecard.ed.gov/assets/CollegeScorecardDataDictionary.xlsx"
download.file(URL, destfile="raw-data/dictionary.xlsx")
## read in the dictionary to get the variables
dict = read_excel("raw-data/dictionary.xlsx", sheet = "data_dictionary")
colnames(dict) = tolower(gsub(" ", "_", colnames(dict)))
dict = dict %>% filter(!is.na(variable_name))
## create a data dictionary
## https://stackoverflow.com/questions/46738968/specify-column-types-in-read-csv-by-using-values-in-a-dataframe/46742411#46742411
dict <- dict %>% mutate(variable_type = case_when(api_data_type == "integer" ~ "i",
api_data_type == "autocomplete" ~ "c", #assumption that this is a string
api_data_type == "string" ~ "c",
api_data_type == "float" ~ "d"))
回報:
> ## read in the dictionary to get the variables
> dict = read_excel("raw-data/dictionary.xlsx", sheet = "data_dictionary")
> colnames(dict) = tolower(gsub(" ", "_", colnames(dict)))
> dict = dict %>% filter(!is.na(variable_name))
> dict <- dict %>% mutate(variable_type = case_when(api_data_type == "integer" ~ "i",
+ api_data_type == "autocomplete" ~ "c", #assumption that this is a string
+ api_data_type == "string" ~ "c",
+ api_data_type == "float" ~ "d"))
Error: object 'api_data_type' not found
和我sessionInfo
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.6 (El Capitan)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] stringr_1.2.0 readxl_0.1.1 readr_1.1.0 dplyr_0.5.0
loaded via a namespace (and not attached):
[1] rjson_0.2.15 lazyeval_0.2.0 magrittr_1.5 R6_2.2.2 assertthat_0.1 hms_0.2 DBI_0.7 tools_3.3.1
[9] tibble_1.2 yaml_2.1.14 Rcpp_0.12.11 stringi_1.1.5 jsonlite_1.5
我不久將發佈 「完全」 可重複的解決方案。 – Jas
也許你必須升級你的dplyr版本。我有v0.7.4 – Jas