使用R的阿拉伯語文本挖掘

我是一個新用戶，我只想獲得關於我在R上的工作的幫助。我正在做阿拉伯語文本挖掘，我很想幫助任何人在這個領域有經驗。到目前爲止，我覺得要規範化阿拉伯語文本，甚至R不會在控制檯中打印阿拉伯字符。我現在被困住了，我不知道是否改變語言就像在Weka或其他任何方式進行採礦一樣。任何人都可以告訴我，如果有人在使用R挖掘阿拉伯文文本中取得任何成就？
順便說一下，我正在研究阿拉伯語tweets數據集分析。我花了一個月的時間來獲取數據。而且我不知道需要多久才能對文本進行預處理。使用R的阿拉伯語文本挖掘

來源

2014-09-03 cecilia

的StackOverflow是針對特定的編程問題，而不是一般的網絡。您的問題在這一點上過於寬泛。請嘗試編輯以使其專注於單個編程任務。包括一個[可重現的示例]（http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example），顯示您遇到的問題。如果您最關心R中顯示的值，請說明您正在使用的操作系統以及您正在運行的R版本和GUI。 – MrFlick 2014-09-04 03:15:15

我沒有這方面的很多經驗，但我沒有與阿拉伯字符問題，當我試試這個：

require(tm) 
require(tm.plugin.webmining) 
require(SnowballC) 

corpus <- WebCorpus(GoogleNewsSource("سلام")) 
corpus 
inspect(corpus) 

tdm <- TermDocumentMatrix(corpus)

確保您的操作系統和IDE上安裝適當的字體。

```{r} 
y <<- dget("file") # get the file ext rated from MongoDB with rmongodb package 
a <<- y$tweet_text # extract only the text of the tweets in the dataset 
text_df <<- data.frame(a, stringsAsFactors = FALSE) # Save as a data frame 
myCorpus_df <<- Corpus(DataframeSource(text_df_2)) # Compute a Corpus from the data frame 
```

在OS X阿拉伯字符適當的代表：

```{r} 
str(myCorpus_df[1:2]) 
``` 

List of 2 
$ 1:List of 2 
    ..$ content: chr "The CHRONICLE EYE Ahrar al#Sham is clearly fighting #ISIS where its men storm some #Manbij buildings #Aleppo " 
    ..$ meta :List of 7 
    .. ..$ author  : chr(0) 
    .. ..$ datetimestamp: POSIXlt[1:1], format: "2014-07-03 22:42:18" 
    .. ..$ description : chr(0) 
    .. ..$ heading  : chr(0) 
    .. ..$ id   : chr "1" 
    .. ..$ language  : chr "en" 
    .. ..$ origin  : chr(0) 
    .. ..- attr(*, "class")= chr "TextDocumentMeta" 
    ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument" 


$ 2:List of 2 
    ..$ content: chr "RT @######## جبهة النصرة مهاجرينها وأنصارها مقراتها مكان آمن لكل من يخشى على نفسه الآذى " 
    ..$ meta :List of 7 
    .. ..$ author  : chr(0) 
    .. ..$ datetimestamp: POSIXlt[1:1], format: "2014-07-03 22:42:18" 
    .. ..$ description : chr(0) 
    .. ..$ heading  : chr(0) 
    .. ..$ id   : chr "2" 
    .. ..$ language  : chr "en" 
    .. ..$ origin  : chr(0) 
    .. ..- attr(*, "class")= chr "TextDocumentMeta" 
    ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument" 
- attr(*, "class")= chr [1:2] "VCorpus" "Corpus"

當我檢查一個阿拉伯字的兩個OS上的編碼（OS X和Win 7），它似乎是良好編碼的：

```{r} 
Encoding("لمياه_و_الإصحا") 
``` 

[1] "UTF-8"

這也可能會有所幫助： Reading arabic data text in R and plot()

來源

2014-09-04 03:27:57

非常感謝您的幫助。事實上，我今天要使用我的mac書（mac OS）並查看結果。我使用TM和snowballC軟件包，但我沒有使用（tm.plugin.webmining），我希望這會有所幫助。你還有很多事情需要在正規化阿拉伯語文本時嘗試去做嗎？已經成功地使用R.我的論文，我的時間有限，我只需要知道是否有任何人在R中進行了這種挖掘。我會再給它一週，看看我是否沒有成功，我可能會選擇任何其他語言在截止日期前完成我的工作更安全。您的回放非常感謝 – cecilia 2014-09-04 11:59:53

很高興我能夠幫助一點:)不幸的是，不，我沒有任何正常化阿拉伯語文本的經驗。我認爲這是一個非常有趣的問題，我鼓勵你嘗試從不同領域招募幫助，因爲這是你的畢業論文。例如，也許你應該去語言，機器學習和阿拉伯語聊天室的Freenode IRC，並告訴人們你正在進行的項目。也許給他們一個這個問題的鏈接。也可以嘗試從www.proz.com/forum/arabic-45.html – 2014-09-04 13:06:04

'corpus < - WebCorpus（GoogleNewsSource（「سلام」））這樣的阿拉伯語言論壇尋求更多幫助。 – 2016-11-24 16:48:40

使用R的阿拉伯語文本挖掘

回答

相關問題