使用Spark和R進行時間序列趨勢檢測

我對R和Spark都是新手，但我試圖創建一個可擴展的R應用程序來檢測用戶執行的增加/減少查詢。使用Spark和R進行時間序列趨勢檢測

我必須包含以下格式數據的星火據幀：

+-------+------------------------+-------------------------+ 
| user |   query   |  query_time  | 
+-------+------------------------+-------------------------+ 
| user1 | Hp tablet    | 2011-08-21T11:07:57.346 | 
| user2 | Hp tablet    | 2011-08-21T22:22:32.599 | 
| user3 | Hp tablet    | 2011-08-22T19:08:57.412 | 
| user4 | hp laptop    | 2011-09-05T15:33:31.489 | 
| user5 | Samsung LCD 550  | 2011-09-01T10:28:33.547 | 
| user6 | memory stick   | 2011-09-06T17:15:42.852 | 
| user7 | Castle     | 2011-08-28T22:06:37.618 | 
+-------+------------------------+-------------------------+

這個數據集有數百行的萬。我需要能夠以某種方式形象化，例如，「hp tablet」正在呈現趨勢。

我已經看了一些庫（例如Breakout Detection，Anomaly Detection和this question），可以幫助我實現這一點，但我不知道他們是否有火花發揮出色。如果他們這樣做，我找不到有關如何編程的例子。

我正在使用R版本3.4.0和SparkR版本2.1.0，在Zeppelin筆記本上運行。

有沒有人有任何想法？我也接受任何其他方法。謝謝！

%sql 
select * from temp_query

屏幕1

屏幕2：：上面創建

來源

2017-06-05 Hannon César

%r 
#created a sparkR dataframe 
df_query <- createDataFrame(sqlContext, data.frame(query = c("Hp tablet","Hp tablet","Hp tablet","hp laptop", "Samsung LCD 550 "), 
query_time = c("2011-08-21T11:07:57.346","2011-08-21T22:22:32.599","2011-08-22T19:08:57.412","2011-09-05T15:33:31.489","2011-09-01T10:28:33.547"))) 

#remove T as its not a timestamp format = "yyyy-MM-dd HH:mm:ss" 
df_query_1 <- select(df_query, df_query$query, regexp_replace(df_query$query_time, '(T)', ' ')) 
+----------------+--------------------------------+ 
|   query|regexp_replace(query_time,(T),)| 
+----------------+--------------------------------+ 
|  Hp tablet|   2011-08-21 11:07:...| 
|  Hp tablet|   2011-08-21 22:22:...| 
|  Hp tablet|   2011-08-22 19:08:...| 
|  hp laptop|   2011-09-05 15:33:...| 
|Samsung LCD 550 |   2011-09-01 10:28:...| 
+----------------+--------------------------------+ 

df_query_1 <- rename(df_query_1, query_time=df_query_1[[2]]) 

#registering temp table: 
registerTempTable(df_query_1, "temp_query")

從臨時表可視化可視化爲柱狀圖

來源

2017-06-06 07:38:19

嗨@Arun，非常感謝您的詳細解答。因爲我是Spark新手，這已經幫助我理解了一些東西，但我認爲它不能回答我原來的問題。我需要知道隨着時間的推移哪些查詢會越來越流行，所以X軸應該是時間序列。我在想，也許[這個突破檢測庫]（https://github.com/twitter/BreakoutDetection）會訣竅，但我不知道如何將它與SparkR一起使用。 –

對於使用庫（AnomalyDetection），數據應處於這種格式

head(raw_data) 
       timestamp count 
14393 1980-10-05 13:53:00 149.801 
14394 1980-10-05 13:54:00 151.492 
14395 1980-10-05 13:55:00 151.724 
14396 1980-10-05 13:56:00 153.776 
14397 1980-10-05 13:57:00 150.481 
14398 1980-10-05 13:58:00 146.638

如果您的query_time是X軸，您將如何定義數字中的Y軸，並且在2011-08-21T11:07:57.346什麼是T意味着，所花費的時間是11:07:57.346。需要更多澄清

來源

2017-06-07 05:26:51

使用Spark和R進行時間序列趨勢檢測

回答

相關問題