Scala將[Seq [string]轉換爲[String]？（在詞形化後的TF-IDF）

我嘗試學習scala和特定文本minning（詞形化，TF-IDF矩陣和LSA）。Scala將[Seq [string]轉換爲[String]？（在詞形化後的TF-IDF）

我有一些文本我想要lemmatize並作出分類（LSA）。我在cloudera上使用spark。

所以我用了stanfordCore NLP fonction：

def plainTextToLemmas(text: String, stopWords: Set[String]): Seq[String] = { 
    val props = new Properties() 
    props.put("annotators", "tokenize, ssplit, pos, lemma") 
    val pipeline = new StanfordCoreNLP(props) 
    val doc = new Annotation(text) 
    pipeline.annotate(doc) 
    val lemmas = new ArrayBuffer[String]() 
    val sentences = doc.get(classOf[SentencesAnnotation]) 
    for (sentence <- sentences; token <-sentence.get(classOf[TokensAnnotation])) { 
    val lemma = token.get(classOf[LemmaAnnotation]) 
    if (lemma.length > 2 && !stopWords.contains(lemma)) { 
    lemmas += lemma.toLowerCase 
    } 
    } 
    lemmas 
    }

在那之後，我試圖使TF-IDF矩陣，但這裏是我的問題：斯坦福fonction使在RDD [序列[字符串]形成。但是，我有一個錯誤。我需要以[String]形式（而不是[Seq [string]]形式）使用RDD。

val (termDocMatrix, termIds, docIds, idfs) = termDocumentMatrix(lemmatized-text, stopWords, numTerms, sc)

有人知道如何將[Seq [string]]轉換爲[String]？

或者我需要更改我的要求之一？

感謝您的幫助。對不起，如果這是一個愚蠢的問題和英語。

再見

來源

2017-07-16 So ode

對不起，我需要澄清我的問題。在[Seq [字符串形式]]中，詞典化函數做了一個RDD，但我只需要一個[字符串形式]給tf-idf。你知道一個形式爲[String]的詞形化功能嗎？ –

我不知道這個詞形還原啄是什麼，但只要做一個串出一個序列，你可以做seq.mkString("\n")（或替換「\ n」和你想要的任何其他分隔符），或者只需要seq.mkString，如果你想要它合併沒有任何分隔符。

另外，不要使用可變的結構，它在斯卡拉味道不好：

val lemmas = sentences 
    .map(_.get(classOf[TokensAnnotation])) 
    .map(_.get(classOf[LemmaAnnotation])) 
    .filter(_.length > 2) 
    .filterNot(stopWords) 
    .mkString

來源

2017-07-16 13:51:27 Dima

Scala將[Seq [string]轉換爲[String]？ （在詞形化後的TF-IDF）

回答

相關問題

Scala將[Seq [string]轉換爲[String]？（在詞形化後的TF-IDF）