提取類別信息，基於相似性圖案

假設我有以下的數據幀：提取類別信息，基於相似性圖案

table<-data.frame(col1=c('4.3 automatic version 1', '3.2 manual version 2', 
         '2.3 version 1', '9.0 version 6'), 
        col2=c('ite auto version 2', 'ite version 3', '2.5 manual version 2', 
         'vserion auto 5')) 

        col1     col2 
1 4.3 automatic version 1 ite auto version 2 
2 3.2 manual version 2  ite version 3 
3   2.3 version 1 2.5 manual version 2 
4   9.0 version 6  vserion auto 5

我想使用的值增加一列只有「自動」或「手動」，基於該列1和列2的內容。如果col1 或 col2包含某個詞，如「auto」或「automatic」，則col3將爲「自動」。如果COL1 或 COL2是像「手動」然後COL3是「手動」，這樣的：

     col1     col2  col3 
1 4.3 automatic version 1 ite auto version 2 automatic 
2 3.2 manual version 2  ite version 3 manual 
3   2.3 version 1 2.5 manual version 2 manual 
4   9.0 version 6  vserion auto 5 automatic

來源

2017-06-06 lolo

將是二進制（僅'auto'或'manual'），或更開放式的（自動，手動，兩者都不是，這兩個，其他列3 ... ） –

我喜歡讓事情變得靈活。我也喜歡保持中間數據結構。所以幾乎肯定會比這更短，更高效的內存。

請注意，我使用正則表達式的靈活性來搜索（根據你使用的話相似和像）。爲了演示效果，我對輸入數據進行了一些更改。我還添加了一些邊緣情況。

另一種方法可能使用tm文本挖掘軟件包。這會給你更多的靈活性，這個解決方案的代價是一些額外的複雜性。

my.table <- 
    data.frame(
    col1 = c(
     '4.3 automatic version 1', 
     '3.2 manual version 2', 
     '2.3 version 1', 
     '9.0 version 6', 
     'maybe standard', 
     'or neither' 
    ), 
    col2 = c(
     'ite automated version 2', 
     'ite version 3', 
     '2.5 manual version 2', 
     'vserion auto 5', 
     'maybe automatic', 
     'for reals' 
    ) 
) 

search.terms <- c("auto|automated|automatic", "manual|standard") 
names(search.terms) <- c("automatic", "manual") 

term.test <- function(term) { 
    term.pres <- apply(
    my.table, 
    MARGIN = 1, 
    FUN = function(one.cell) { 
     any(grep(pattern = term, x = one.cell)) 
    } 
) 
    return(term.pres) 
} 

term.presence <- lapply(X = search.terms, term.test) 

term.presence <- do.call(cbind.data.frame, term.presence) 

names(term.presence) <- names(search.terms) 

as.labels <- lapply(names(search.terms), function(one.term) { 
    tempcol <- tempflag <- term.presence[, one.term] 
    tempcol <- rep('', length(tempflag)) 
    tempcol[tempflag] <- one.term 
    return(tempcol) 
}) 

as.labels <- do.call(cbind.data.frame, as.labels) 
names(as.labels) <- search.terms 

labels.concat <- 
    apply(
    as.labels, 
    MARGIN = 1, 
    FUN = function(one.row) { 
     temp <- unique(sort(one.row)) 
     temp <- temp[nchar(temp) > 0] 
     temp <- paste(temp, sep = ", ", collapse = "; ") 
     return(temp) 
    } 
) 

my.table$col3 <- labels.concat 

print(my.table)

這給

     col1     col2    col3 
1 4.3 automatic version 1 ite automated version 2   automatic 
2 3.2 manual version 2   ite version 3   manual 
3   2.3 version 1 2.5 manual version 2   manual 
4   9.0 version 6   vserion auto 5   automatic 
5   maybe standard   maybe automatic automatic; manual 
6    or neither    for reals     
>

來源

2017-06-06 23:13:14

優秀的答案！ – lolo

謝謝。確保你正在尋找最近的答案......它在最後一個小時內發展了很多！我想我現在已經完成了。 –

再次感謝您，我直到現在纔看到上次更新。更好的是，儘管最後一個True或False版本也是有用的。 – lolo

提取類別信息，基於相似性圖案

回答

相關問題