2017-02-18 51 views
1

我想子集只包含子字符串,然後刪除子字符串。我可以做的第一部分,但我不知道如何刪除子子集DNAStringSet的子模式,並刪除R中的子模式

下面是一個例子

library(Biostrings) 
myseq <-DNAStringSet(c("CCCATGAAAGATCGGAAGAGCACACGTCTGAACCCATGAA", "CCCATGAACATAGATCC", "CCCGTACAGATCACGTG")) 
names(myseq) <- letters[1:3] 
myseq 

A DNAStringSet instance of length 3 
width seq                           names    
[1] 40 CCCATGAAAGATCGGAAGAGCACACGTCTGAACCCATGAA                 a 
[2] 17 CCCATGAACATAGATCC                       b 
[3] 17 CCCGTACAGATCACGTG                       c 

我想刪除的順序是AGATCGGAAGAGCACACGTCTGAA這是在第一線。

matchPattern("AGATCGGAAGAGCACACGTCTGAA", myseq[[1]]) 

Views on a 40-letter DNAString subject 
subject: CCCATGAAAGATCGGAAGAGCACACGTCTGAACCCATGAA 
views: 
    start end width 
[1]  9 32 24 [AGATCGGAAGAGCACACGTCTGAA] 

於子集我做到以下幾點:

pat <- vmatchPattern("AGATCGGAAGAGCACACGTCTGAA", myseq) 
myseq[ lapply(lapply(pat, isEmpty), function(x) x == FALSE) ] 

A DNAStringSet instance of length 3 
    width seq                           names    
[1] 40 CCCATGAAAGATCGGAAGAGCACACGTCTGAACCCATGAA                 a 
[2]  0                            b 
[3]  0                            c 

輸出應該

A DNAStringSet instance of length 3 
    width seq                           names    
[1] 11 CCCCCCATGAA                         a 
[2]  0                            b 
[3]  0                            c 

回答

1

你可以使用vcountPattern成計數ifelse語句內匹配,和str_replace和非輸出替換匹配

myseq2 <- DNAStringSet(
      unlist(
       lapply(
       vcountPattern(
       'AGATCGGAAGAGCACACGTCTGAA', myseq) > 0, 
        ifelse, 
        str_replace(
        myseq, 
        'AGATCGGAAGAGCACACGTCTGAA', 
        ''), 
       '') 
      ) 
      ) 
names(myseq2) <- names(myseq) 
myseq2 

>A DNAStringSet instance of length 3 
>width seq              names    
>[1] 16 CCCATGAACCCATGAA          a 
>[2]  0               b 
>[3]  0               c 

略多可讀與管符號::

lapply(vcountPattern('AGATCGGAAGAGCACACGTCTGAA', myseq) > 0, ifelse, str_replace(myseq, 'AGATCGGAAGAGCACACGTCTGAA', ''), '') %>% 
    unlist() %>% 
    DNAStringSet() -> myseq2 
與空字符串 - 匹配
0

我不熟悉生物信息學包,但如果你是確定與轉換數據(我相信應該可以將列表轉換爲包中使用的格式),可以使用以下方法:

1)使用stringr庫來除去所希望的圖案 2)計算新的圖案的長度

# load biostrings package 
library(Biostrings) 

# create sample dataset 
myseq <-DNAStringSet(c("CCCATGAAAGATCGGAAGAGCACACGTCTGAACCCATGAA", "CCCATGAACATAGATCC", "CCCGTACAGATCACGTG")) 
names(myseq) <- letters[1:3] 

# remove sequences with no match 
pat <- vmatchPattern("AGATCGGAAGAGCACACGTCTGAA", myseq) 
data <- myseq[ lapply(lapply(pat, isEmpty), function(x) x == FALSE) ] 

# load stringr library 
library(stringr) 

# replace the matched sequence 
test <- lapply(test, str_replace, "AGATCGGAAGAGCACACGTCTGAA", "") 
# put together the new sequence and its length 
test <- mapply(c, lapply(test, nchar), test, SIMPLIFY = FALSE)