在Apache Spark中更改分隔符

我是Apache Spark新手，我希望能夠讀取XML文件並計算每個標題的字數。 XML文件是這樣的：在Apache Spark中更改分隔符

<title>first title</title> 
<words>there are seven words in this example</words> 
<title>second title</title> 
<words>there are more words here, ten words to be precise</words>

我使用Python編寫的Spark工作，但是當我輸入

sc.textFile("file://...")

它會自動拆分使用換行符（\ n）的爲我的文件其分隔符。我希望它分成幾行，直到它再次找到「< title>」。

我想獲得會是這樣的結果：

first title: 7 
second title: 10

我怎樣才能做到這一點？

在此先感謝

來源

2017-09-26 disjunctive

你能檢查這個https://stackoverflow.com/questions/46408558/how-to-handle-multi-line-rows-in-spark/46410029#46410029 –

我建議給一個嘗試spark-xml，如果你使用XML文件的工作。

來源

2017-09-26 13:38:10 Zouzias

在Apache Spark中更改分隔符

回答

相關問題