有分裂列更有效的方式

有執行此函數read.table時不正確導入幾個值：有分裂列更有效的方式

hs.industry <- read.table("https://download.bls.gov/pub/time.series/hs/hs.industry", header = TRUE, fill = TRUE, sep = "\t", quote = "", stringsAsFactors = FALSE)

具體而言，有在industry_code和industry_name結合在一起形成幾個值industry_code列中的單個值（不知道爲什麼）。由於每industry_code是4個位數，我的做法分裂和正確的是：

for (i in 1:nrow(hs.industry)) { 
    if (isTRUE(nchar(hs.industry$industry_code[i]) > 4)) { 
    hs.industry$industry_name[i] <- gsub("[[:digit:]]","",hs.industry$industry_code[i]) 
    hs.industry$industry_code[i] <- gsub("[^0-9]", "",hs.industry$industry_code[i]) 
    } 
}

我覺得這是非常innificent，但我不知道用什麼辦法會更好。

謝謝！

來源

2017-03-06 Michael

問題是，行29和30（第28和29行，如果我們不計算標題）出現格式錯誤。他們使用4個空格而不是正確的製表符。需要額外的數據清理。

使用readLines在原始文本閱讀，更正格式錯誤，然後在清理表中讀取：

# read in each line of the file as a list of character elements 
hs.industry <- readLines('https://download.bls.gov/pub/time.series/hs/hs.industry') 

# replace any instances of 4 spaces with a tab character 
hs.industry <- gsub('\\W{4,}', '\t', hs.industry) 

# collapse together the list, with each line separated by a return character (\n) 
hs.industry <- paste(hs.industry, collapse = '\n') 

# read in the new table 
hs.industry <- read.table(text = hs.industry, sep = '\t', header = T, quote = '')

來源

2017-03-06 18:45:48 jdobres

謝謝！你能否解釋崩潰的必要性？ – Michael

當您使用帶有「text」參數的read.table'時，文本必須是單個字符串，而不是字符串列表。因此，我們用換行符摺疊字符串列表（其中每個項目代表原始文本的一行）。 – jdobres

你不應該遍歷每個實例，而不是隻確定這是有問題的GSUB只有那些條目的條目：

replace_indx <- which(nchar(hs.industry$industry_code) > 4) 
hs.industry$industry_name[replace_indx] <- gsub("\\d+\\s+", "", hs.industry$industry_code[replace_indx]) 
hs.industry$industry_code[replace_indx] <- gsub("\\D+", "", hs.industry$industry_code[replace_indx])

我也用"\\d+\\s+"改善字符串替換，在這裏我也更換空格：

gsub("[[:digit:]]","",hs.industry$industry_code[replace_indx]) 
# [1] " Dimension stone"   " Crushed and broken stone" 

gsub("\\d+\\s+", "", hs.industry$industry_code[replace_indx]) 
# [1] "Dimension stone"   "Crushed and broken stone"

來源

2017-03-06 18:44:10 Djork

有分裂列更有效的方式

回答

相關問題