pyspark：查找包含單詞/主題標籤的鳴叫數

我在分析一個包含Twitter API數據的JSON文件。我想找出在我的數據集中出現多少次哈希標籤或特定單詞。我可以用得到最常見的鳴叫名單：pyspark：查找包含單詞/主題標籤的鳴叫數

print(df.groupby('text').count().sort(desc('count')).show())

所以我知道，例如，利物浦是在數據絕對是一個字。

我只想找到「利物浦」一詞在我的數據集中出現多少次，這是可能的嗎？謝謝

我使用Spark版本1.6.0。

列被命名爲

['_corrupt_record', 'contributors', 'coordinates', 'created_at', 'delete', 
'entities', 'favorite_count', 'favorited', 'filter_level', 'geo', 'id', 
'id_str', 'in_reply_to_screen_name', 'in_reply_to_status_id', 
'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 
'lang', 'place', 'possibly_sensitive', 'retweet_count', 'retweeted', 
'retweeted_status', 'scopes', 'source', 'text', 'truncated', 'user', 
'withheld_in_countries']

來源

2017-05-09 MelesMeles

你能給更多的細節？你在使用spark 2.0+嗎？您是否已經擁有數據框中的數據？你的專欄是什麼？ – flyingmeatball

@flyingmeatball是的，對不起。我正在使用Spark版本1.6.0。列爲['_corrupt_record'，'貢獻者'，'座標'，'created_at'，'刪除'，'實體'，'favorite_count'，'favited'，'filter_level'，'geo'，'id'，'id_str' ，in_reply_to_status_id，in_reply_to_status_id_str，in_reply_to_user_id，in_reply_to_user_id_str，lang，place，來源'，'文本'，'截斷'，'用戶'，'版主'in''_countries'] – MelesMeles

不知道這工作在1.6，我用2.1，但我會做一些類似的：

from pyspark.sql.functions import col 

df.where(col('text').like("%Liverpool%")).count()

來源

2017-05-09 17:59:11 flyingmeatball

謝謝！我需要使用like而不是isin，但是你指向了正確的方向df.where（col（'text'）。like（「％Liverpool％」））。count（） – MelesMeles

@flyingmeatball可能會調整答案，可以標記接受？ – titipata

以上對 – flyingmeatball

pyspark：查找包含單詞/主題標籤的鳴叫數

回答

相關問題