如何將字符串拆分爲給定長度的子字符串？

38

這裏是單程

substring("aabbccccdd", seq(1, 9, 2), seq(2, 10, 2)) 
#[1] "aa" "bb" "cc" "cc" "dd"

或更一般地

text <- "aabbccccdd" 
substring(text, seq(1, nchar(text)-1, 2), seq(2, nchar(text), 2)) 
#[1] "aa" "bb" "cc" "cc" "dd"

編輯：這是很多，要快得多

sst <- strsplit(text, "")[[1]] 
out <- paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])

它首先拆分字符串轉換成字符。然後，將偶數元素和奇怪元素粘貼在一起。

計時

text <- paste(rep(paste0(letters, letters), 1000), collapse="") 
g1 <- function(text) { 
    substring(text, seq(1, nchar(text)-1, 2), seq(2, nchar(text), 2)) 
} 
g2 <- function(text) { 
    sst <- strsplit(text, "")[[1]] 
    paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)]) 
} 
identical(g1(text), g2(text)) 
#[1] TRUE 
library(rbenchmark) 
benchmark(g1=g1(text), g2=g2(text)) 
# test replications elapsed relative user.self sys.self user.child sys.child 
#1 g1   100 95.451 79.87531 95.438  0   0   0 
#2 g2   100 1.195 1.00000  1.196  0   0   0

來源

2012-07-23 20:05:12 GSee

+0

有趣，不知道'substring'。因爲'substr'不需要向量參數用於開始/結束，所以更好。 – 2012-07-23 20:23:25

+2

輝煌！第二個版本真的非常快！ – MadSeb 2012-07-24 01:54:32

+0

我想知道是否有這樣的事情會將「aabbbcccccdd」分成aa bbb ccccc dd 我現在使用grepexpr。 – jackStinger 2013-01-07 12:32:50

8

string <- "aabbccccdd" 
# total length of string 
num.chars <- nchar(string) 

# the indices where each substr will start 
starts <- seq(1,num.chars, by=2) 

# chop it up 
sapply(starts, function(ii) { 
    substr(string, ii, ii+1) 
})

其中給出

[1] "aa" "bb" "cc" "cc" "dd"

來源

2012-07-23 20:09:11

1

人們可以使用一個矩陣來組中的字符：

s2 <- function(x) { 
    m <- matrix(strsplit(x, '')[[1]], nrow=2) 
    apply(m, 2, paste, collapse='') 
} 

s2('aabbccddeeff') 
## [1] "aa" "bb" "cc" "dd" "ee" "ff"

不幸的是，第是奇數字符串長度的輸入休息，給予警告：

s2('abc') 
## [1] "ab" "ca" 
## Warning message: 
## In matrix(strsplit(x, "")[[1]], nrow = 2) : 
## data length [3] is not a sub-multiple or multiple of the number of rows [2]

更不幸的是，g1和g2從@GSee不返回不正確的結果爲奇數字符串長度的輸入：

g1('abc') 
## [1] "ab" 

g2('abc') 
## [1] "ab" "cb"

這裏是s2精神中的功能，對每個組中的字符數取參數，並在必要時留下最後一個條目：

s <- function(x, n) { 
    sst <- strsplit(x, '')[[1]] 
    m <- matrix('', nrow=n, ncol=(length(sst)+n-1)%/%n) 
    m[seq_along(sst)] <- sst 
    apply(m, 2, paste, collapse='') 
} 

s('hello world', 2) 
## [1] "he" "ll" "o " "wo" "rl" "d" 
s('hello world', 3) 
## [1] "hel" "lo " "wor" "ld"

（它確實比g2慢，但通過的7約爲係數比g1更快）

來源

2013-02-18 17:44:59

+0

如果可能有奇數個字符，那麼在我看來，處理這個事實會更快，引入一個'apply'循環。我敢打賭這是更快的：'out < - g2（x）;如果（nchar（x）%% 2 == 1L）out [length（out）] < - substring（out [length（out）]，1,1）; out' – GSee 2013-02-18 19:44:19

1

醜但工程

sequenceString <- "ATGAATAAAG" 

J=3#maximum sequence length in file 
sequenceSmallVecStart <- 
    substring(sequenceString, seq(1, nchar(sequenceString)-J+1, J), 
    seq(J,nchar(sequenceString), J)) 
sequenceSmallVecEnd <- 
    substring(sequenceString, max(seq(J, nchar(sequenceString), J))+1) 
sequenceSmallVec <- 
    c(sequenceSmallVecStart,sequenceSmallVecEnd) 
cat(sequenceSmallVec,sep = "\n")

給出 ATG AAT AAA ģ

來源

2014-04-24 07:28:27 den2042

5

有兩種容易的可能性：

s <- "aabbccccdd"

gregexpr和regmatches：

regmatches(s, gregexpr(".{2}", s))[[1]] 
# [1] "aa" "bb" "cc" "cc" "dd"

strsplit：

strsplit(s, "(?<=.{2})", perl = TRUE)[[1]] 
# [1] "aa" "bb" "cc" "cc" "dd"

來源

2014-04-24 07:38:39

如何將字符串拆分爲給定長度的子字符串？

回答

相關問題