2017-04-02 37 views
0

我正試圖在R中找到一組函數,它將在字級上運行。例如一個可以返回單詞位置的函數。例如,給定以下sentencequery在R中是否有文字處理函數在字級上進行操作?

sentence <- "A sample sentence for demo" 
query <- "for" 
  1. 該函數將返回4. for是4個字。

  2. 如果我可以得到一個效用函數,這將允許我在左右方向上延伸query,這將是非常好的。 例如extend(query, 'right')將返回for demoextend(query, 'left')將返回sentence for

我已經通過了的功能如grep,gregexp,從stringr包等字。所有人似乎都在角色層面上運作。

+0

退房' stringr :: word'。如:word(string,start = 1L,end = start,sep = fixed(「」))'。你也可以用'end = -2L'來得到最後兩個單詞。 – p0bs

回答

0

我寫我自己的功能,如果在sentence發現indexOf方法返回word的索引,否則返回-1,很像java indexOf()

indexOf <- function(sentence, word){ 
    listOfWords <- strsplit(sentence, split = " ") 
    sentenceAsVector <- unlist(listOfWords) 

    if(word %in% sentenceAsVector == FALSE){ 
    result=-1 
    } 
    else{ 
    result = which(sentenceAsVector==word) 
    } 
    return(result) 
} 

extend方法是否工作正常,但很長的看起來不像R代碼。如果query是句子的邊界上的字,即第一個字或最後一個字,前兩個單詞或最後兩個單詞返回

extend <- function(sentence, query, direction){ 
    listOfWords = strsplit(sentence, split = " ") 
    sentenceAsVector = unlist(listOfWords) 
    lengthOfSentence = length(sentenceAsVector) 
    location = indexOf(sentence, query) 
    boundary = FALSE 
    if(location == 1 | location == lengthOfSentence){ 
    boundary = TRUE 
    } 
    else{ 
    boundary = FALSE 
    } 
    if(!boundary){ 
    if(location> 1 & direction == "right"){ 
     return(paste(sentenceAsVector[location], 
        sentenceAsVector[location + 1], 
        sep=" ") 
    ) 
    } 
    else if(location < lengthOfSentence & direction == "left"){ 
     return(paste(sentenceAsVector[location - 1], 
        sentenceAsVector[location], 
        sep=" ") 
    ) 

    } 
    } 
    else{ 
    if(location == 1){ 
     return(paste(sentenceAsVector[1], sentenceAsVector[2], sep = " ")) 
    } 
    if(location == lengthOfSentence){ 
     return(paste(sentenceAsVector[lengthOfSentence - 1], 
        sentenceAsVector[lengthOfSentence], sep = " ")) 
    } 
    } 
} 
0

正如我在我的評論中提到的,stringr在這些情況下很有用。

library(stringr) 

sentence <- "A sample sentence for demo" 
wordNumber <- 4L 

fourthWord <- word(string = sentence, 
        start = wordNumber) 

previousWords <- word(string = sentence, 
         start = wordNumber - 1L, 
         end = wordNumber) 

laterWords <- word(string = sentence, 
        start = wordNumber, 
        end = wordNumber + 1L) 

而這個收益率:

> fourthWord 
[1] "for" 
> previousWords 
[1] "sentence for" 
> laterWords 
[1] "for demo" 

我希望幫助你。

1

如果使用scan,它將在空格分開輸入:

> s.scan <- scan(text=sentence, what="") 
Read 5 items 
> which(s.scan == query) 
[1] 4 

極品what=""告訴掃描期望字符而不是數字輸入。如果您的輸入是完整的英語句子,則可能需要使用gsubpatt="[[:punct:]]"來替換標點符號。如果您嘗試對詞類進行分類或處理大型文檔,可能還需要查看tm(文本挖掘)軟件包。

0

答案取決於你的意思是一個「字」是什麼。如果您的意思是以空格分隔的標記,那麼@ imran-ali的答案可以正常工作。如果你的意思是由Unicode定義的詞,特別注意標點符號,那麼你需要更復雜的東西。

下正確處理標點符號:

library(corpus) 
sentence <- "A sample sentence for demo" 
query <- "for" 

# use text_locate to find all instances of the query, with context 
text_locate(sentence, query) 
## text    before    instance    after    
## 1 1     A sample sentence for  demo    

# find the number of tokens before, then add 1 to get the position 
text_ntoken(text_locate(sentence, query)$before) + 1 
## 4 

如果有多個匹配這也適用於:

sentence2 <- "for one, for two! for three? for four" 
text_ntoken(text_locate(sentence2, query)$before) + 1 
## [1] 1 4 7 10 

我們可以確認這是正確的:

text_tokens(sentence2)[[1]][c(1, 4, 7, 10)] 
## [1] "for" "for" "for" "for" 
相關問題