在R中是否有文字處理函數在字級上進行操作？

我正試圖在R中找到一組函數，它將在字級上運行。例如一個可以返回單詞位置的函數。例如，給定以下sentence和query在R中是否有文字處理函數在字級上進行操作？

sentence <- "A sample sentence for demo" 
query <- "for"

該函數將返回4. for是4個字。
如果我可以得到一個效用函數，這將允許我在左右方向上延伸query，這將是非常好的。例如extend(query, 'right')將返回for demo和extend(query, 'left')將返回sentence for

我已經通過了的功能如grep，gregexp，從stringr包等字。所有人似乎都在角色層面上運作。

來源

2017-04-02 Imran Ali

退房' stringr :: word'。如：word（string，start = 1L，end = start，sep = fixed（「」））'。你也可以用'end = -2L'來得到最後兩個單詞。 – p0bs

我寫我自己的功能，如果在sentence發現indexOf方法返回word的索引，否則返回-1，很像java indexOf()

indexOf <- function(sentence, word){ 
    listOfWords <- strsplit(sentence, split = " ") 
    sentenceAsVector <- unlist(listOfWords) 

    if(word %in% sentenceAsVector == FALSE){ 
    result=-1 
    } 
    else{ 
    result = which(sentenceAsVector==word) 
    } 
    return(result) 
}

的extend方法是否工作正常，但很長的看起來不像R代碼。如果query是句子的邊界上的字，即第一個字或最後一個字，前兩個單詞或最後兩個單詞返回

extend <- function(sentence, query, direction){ 
    listOfWords = strsplit(sentence, split = " ") 
    sentenceAsVector = unlist(listOfWords) 
    lengthOfSentence = length(sentenceAsVector) 
    location = indexOf(sentence, query) 
    boundary = FALSE 
    if(location == 1 | location == lengthOfSentence){ 
    boundary = TRUE 
    } 
    else{ 
    boundary = FALSE 
    } 
    if(!boundary){ 
    if(location> 1 & direction == "right"){ 
     return(paste(sentenceAsVector[location], 
        sentenceAsVector[location + 1], 
        sep=" ") 
    ) 
    } 
    else if(location < lengthOfSentence & direction == "left"){ 
     return(paste(sentenceAsVector[location - 1], 
        sentenceAsVector[location], 
        sep=" ") 
    ) 

    } 
    } 
    else{ 
    if(location == 1){ 
     return(paste(sentenceAsVector[1], sentenceAsVector[2], sep = " ")) 
    } 
    if(location == lengthOfSentence){ 
     return(paste(sentenceAsVector[lengthOfSentence - 1], 
        sentenceAsVector[lengthOfSentence], sep = " ")) 
    } 
    } 
}

來源

2017-04-06 19:46:09

正如我在我的評論中提到的，stringr在這些情況下很有用。

library(stringr) 

sentence <- "A sample sentence for demo" 
wordNumber <- 4L 

fourthWord <- word(string = sentence, 
        start = wordNumber) 

previousWords <- word(string = sentence, 
         start = wordNumber - 1L, 
         end = wordNumber) 

laterWords <- word(string = sentence, 
        start = wordNumber, 
        end = wordNumber + 1L)

而這個收益率：

> fourthWord 
[1] "for" 
> previousWords 
[1] "sentence for" 
> laterWords 
[1] "for demo"

我希望幫助你。

來源

2017-04-02 16:32:25 p0bs

如果使用scan，它將在空格分開輸入：

> s.scan <- scan(text=sentence, what="") 
Read 5 items 
> which(s.scan == query) 
[1] 4

極品what=""告訴掃描期望字符而不是數字輸入。如果您的輸入是完整的英語句子，則可能需要使用gsub和patt="[[:punct:]]"來替換標點符號。如果您嘗試對詞類進行分類或處理大型文檔，可能還需要查看tm（文本挖掘）軟件包。

來源

2017-04-02 17:52:53

答案取決於你的意思是一個「字」是什麼。如果您的意思是以空格分隔的標記，那麼@ imran-ali的答案可以正常工作。如果你的意思是由Unicode定義的詞，特別注意標點符號，那麼你需要更復雜的東西。

下正確處理標點符號：

library(corpus) 
sentence <- "A sample sentence for demo" 
query <- "for" 

# use text_locate to find all instances of the query, with context 
text_locate(sentence, query) 
## text    before    instance    after    
## 1 1     A sample sentence for  demo    

# find the number of tokens before, then add 1 to get the position 
text_ntoken(text_locate(sentence, query)$before) + 1 
## 4

如果有多個匹配這也適用於：

sentence2 <- "for one, for two! for three? for four" 
text_ntoken(text_locate(sentence2, query)$before) + 1 
## [1] 1 4 7 10

我們可以確認這是正確的：

text_tokens(sentence2)[[1]][c(1, 4, 7, 10)] 
## [1] "for" "for" "for" "for"

來源

2017-10-04 22:04:08

在R中是否有文字處理函數在字級上進行操作？

回答

相關問題