2011-10-12 534 views
2

我有一個包含各帶有一個「樣本」相關聯的長字符串的數據幀:分手一個字符串轉換爲多個字符串在不同的行

Sample Data 
    1  000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N 
    2  000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N 

我想編寫一個簡單的方法來打破這種字符串轉換成5片以下面的格式:

Sample X 
CCT6 - Characters 1-33 
GAT1 - Characters 34-68 
IMD3 - Characters 69-99 
PDR3 - Characters 100-130 
RIM15 - Characters 131-168 

給予的輸出看起來像這樣對於每個樣品:

Sample 1 
CCT6 - 000000000000000000000000000N01000 
GAT1 - 000000000N0N000000000N00N0000NN00N0 
IMD3 - N000000100000N00N0N0000000NNNN0 
PDR3 - 1111111111111111111111111111111 
RIM15 - 0000000000000000000N000000N0000000000N 

我已經能夠使用substr功能打破了長串個片但我還想能夠自動執行它,所以我可以得到所有5個在一個輸出。理想情況下,這個輸出也是一個數據框。

回答

5

這是?read.fwf是。

首先,一些數據看起來像你的問題:

x <- data.frame(Sample = c(1, 2), Data = c("000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N", 
"000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N"), 
stringsAsFactors = FALSE) 

現在使用read.fwf,指定每個字段和他們的名字的寬度,這都應該是模式character。我們將示例數據的文本列包裝在textConnection中,以便我們可以將其視爲一般由read.*和其他函數理解的連接。

(strs <- read.fwf(textConnection(x$Data), widths = c(33, 35, 31, 31, 38), colClasses = "character", col.names = c("CCT6", "GAT1", "IMD3", "PDR3", "RIM15"))) 


           CCT6        GAT1       IMD3       PDR3         RIM15 
1 000000000000000000000000000N01000 000000000N0N000000000N00N0000NN00N0 N000000100000N00N0N0000000NNNN0 1111111111111111111111111111111 0000000000000000000N000000N0000000000N 
2 000000000000000000000000000N01000 000000000N0N000000000N00N0000NN00N0 N000000100000N00N0N0000000NNNN0 1111111111111111111111111111111 0000000000000000000N000000N0000000000N 

現在遍歷所有的行並打印出每一個按你的例子​​:

for (i in 1:nrow(strs)) { 
    writeLines(paste("Sample", i)) 
    writeLines(paste(names(strs), strs[i, ], sep = " - ")) 
} 

給,例如:

Sample 2 
CCT6 - 000000000000000000000000000N01000 
GAT1 - 000000000N0N000000000N00N0000NN00N0 
IMD3 - N000000100000N00N0N0000000NNNN0 
PDR3 - 1111111111111111111111111111111 
RIM15 - 0000000000000000000N000000N0000000000N 
+0

這很好用!我只是不知道如何保存最終數據,以便以後可以再次訪問它。 –

+0

你可以打開一個文件連接並使用帶有'con ='參數的writeLines,或者你可以使用'save(strs,file =「strpieces.rda」)' –

+0

現在用這個代碼運行的一個問題是它從最終結構中的數據中分離出原始樣本ID號。在我的例子中,樣本從1開始依次出現。但是,在我的實際數據集中,情況並非如此。我怎樣才能保持連接,以便最終的輸出將具有原始數據表中附加到分解字符串的任何樣本? –

1
SampX <- textConnection("CCT6 - Characters 1-33 
GAT1 - Characters 34-68 
IMD3 - Characters 69-99 
PDR3 - Characters 100-130 
RIM15 - Characters 131-168") 
dfSampX <-read.table(SampX, sep="-") 
dfSampX$V4 <- as.numeric(sub("Characters ", "", dfSampX$V2)) 

sampdat <- read.table(textConnection("Sample Data 
    1  000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N 
    2  000000000000000000000000000N01000000000000N0N000000000N00N0000NN00N0N000000100000N00N0N0000000NNNN011111111111111111111111111111110000000000000000000N000000N0000000000N 
"), header=TRUE,stringsAsFactors=FALSE) 

此代碼將細分爲羣:

apply(dfSampX[,c(3,4)], 1, function(x) substr(sampdat[,2], x["V4"], x["V3"])) 
    [,1]        [,2]         
[1,] "000000000000000000000000000N01000" "000000000N0N000000000N00N0000NN00N0" 
[2,] "000000000000000000000000000N01000" "000000000N0N000000000N00N0000NN00N0" 
    [,3]        [,4]        
[1,] "N000000100000N00N0N0000000NNNN0" "1111111111111111111111111111111" 
[2,] "N000000100000N00N0N0000000NNNN0" "1111111111111111111111111111111" 
    [,5]          
[1,] "0000000000000000000N000000N0000000000N" 
[2,] "0000000000000000000N000000N0000000000N" 

這個代碼將提供以列表格式片段:

res <- lapply(sampdat$Data, function(x) 
      apply(dfSampX[,c(3,4)], 1, function(y) substr(x, y["V4"], y["V3"]))) 

res2 <- lapply(res, function(x){ names(x) <- dfSampX$V1 ; return(x)}) 
res2 

[[1]] 
            CCT6          GAT1 
    "000000000000000000000000000N01000" "000000000N0N000000000N00N0000NN00N0" 
            IMD3          PDR3 
     "N000000100000N00N0N0000000NNNN0"  "1111111111111111111111111111111" 
            RIM15 
"0000000000000000000N000000N0000000000N" 

[[2]] 
            CCT6          GAT1 
    "000000000000000000000000000N01000" "000000000N0N000000000N00N0000NN00N0" 
            IMD3          PDR3 
     "N000000100000N00N0N0000000NNNN0"  "1111111111111111111111111111111" 
            RIM15 
"0000000000000000000N000000N0000000000N" 

而且能獲得指定的輸出格式:

for (samp in seq_along(res2)) { cat("Sample ", samp, "\n") 
     invisible(sapply(1:5, function(y) 
      cat(as.character(dfSampX$V1[y]), " - ", res2[[samp]][y], "\n"))) } 
Sample 1 
CCT6 - 000000000000000000000000000N01000 
GAT1 - 000000000N0N000000000N00N0000NN00N0 
IMD3 - N000000100000N00N0N0000000NNNN0 
PDR3 - 1111111111111111111111111111111 
RIM15 - 0000000000000000000N000000N0000000000N 
Sample 2 
CCT6 - 000000000000000000000000000N01000 
GAT1 - 000000000N0N000000000N00N0000NN00N0 
IMD3 - N000000100000N00N0N0000000NNNN0 
PDR3 - 1111111111111111111111111111111 
RIM15 - 0000000000000000000N000000N0000000000N 

The 01需要來抑制列表結構中的NULL返回。

+0

嗯......我不相信這我正在尋找什麼。 Id喜歡能夠在具有多個樣本的數據框上運行腳本。在上面看來,你已經將整個字符串輸入到每個樣本的代碼中。編號也喜歡我的輸出看起來像我上面提供的例子。 –

+0

你用str()看過「sampdat」對象嗎?它與你的數據不同嗎?如果是這樣,請在您的對象上提供dput()。 –

+0

添加了一個命名步驟。 –

相關問題