保持跟蹤字接近

我正在處理一個小型項目，涉及基於字典的文本搜索文檔集合。我的字典有正面的信號詞（又名好詞），但在文檔集合中，找到一個詞並不能保證一個積極的結果，因爲可能有負面的詞例如（不是不重要的），可能在這些正面詞的附近。我想要構造一個矩陣，使其包含文檔編號，正面詞以及其與負面詞的接近度。保持跟蹤字接近

任何人都可以請建議一種方法來做到這一點。我的項目處於非常早期階段，所以我給出了我的文本的基本示例。

No significant drug interactions have been reported in studies of candesartan cilexetil given with other drugs such as glyburide, nifedipine, digoxin, warfarin, hydrochlorothiazide.

這是我的示例文件，其中坎地沙坦酯，格列本脲，硝苯地平，地高辛，華法林，氫氯噻嗪是我積極的話，沒有顯著是我的否定詞。我想在我的正面和聳人聽聞的單詞之間做一個鄰近（基於詞）的映射。

任何人都可以提供一些有用的指針嗎？

來源

2010-06-21 Shreyas Karnik

首先，我建議不要使用R來完成此任務。 R對很多事情都很好，但文本操作不是其中之一。 Python可能是一個很好的選擇。

這就是說，如果我是R中實現這一點，我可能會做這樣的事情（非常非常粗糙）：

# You will probably read these from an external file or a database 
goodWords <- c("candesartan cilexetil", "glyburide", "nifedipine", "digoxin", "blabla", "warfarin", "hydrochlorothiazide") 
badWords <- c("no significant", "other drugs") 

mytext <- "no significant drug interactions have been reported in studies of candesartan cilexetil given with other drugs such as glyburide, nifedipine, digoxin, warfarin, hydrochlorothiazide." 
mytext <- tolower(mytext) # Let's make life a little bit easier... 

goodPos <- NULL 
badPos <- NULL 

# First we find the good words 
for (w in goodWords) 
    { 
    pos <- regexpr(w, mytext) 
    if (pos != -1) 
     { 
     cat(paste(w, "found at position", pos, "\n")) 
     } 
    else  
     { 
     pos <- NA 
     cat(paste(w, "not found\n")) 
     } 

    goodPos <- c(goodPos, pos) 
    } 

# And then the bad words 
for (w in badWords) 
    { 
    pos <- regexpr(w, mytext) 
    if (pos != -1) 
     { 
     cat(paste(w, "found at position", pos, "\n")) 
     } 
    else  
     { 
     pos <- NA 
     cat(paste(w, "not found\n")) 
     } 

    badPos <- c(badPos, pos) 
    } 

# Note that we use -badPos so that when can calculate the distance with rowSums 
comb <- expand.grid(goodPos, -badPos) 
wordcomb <- expand.grid(goodWords, badWords) 
dst <- cbind(wordcomb, abs(rowSums(comb))) 

mn <- which.min(dst[,3]) 
cat(paste("The closest good-bad word pair is: ", dst[mn, 1],"-", dst[mn, 2],"\n"))

來源

2010-06-21 15:15:53 nico

我幾乎找到了我正在尋找的東西。謝謝尼科！ – 2010-06-21 15:26:38

你看的

Natural Language Processing一方CRAN上的任務視圖或
CRAN上的文本挖掘程序包tm？

來源

2010-06-21 15:18:08

不錯的包，不知道他們！不過，我不認爲R是做這種分析的最佳工具。 – nico 2010-06-21 15:25:01

是的，我經常使用tm包！ – 2010-06-21 15:30:28

保持跟蹤字接近

回答

相關問題