Apache Spark SQL從Cassandra中獲取數十億行數據？

我有以下代碼Apache Spark SQL從Cassandra中獲取數十億行數據？

我調用火花殼如下

./spark-shell --conf spark.cassandra.connection.host=170.99.99.134 --executor-memory 15G --executor-cores 12 --conf spark.cassandra.input.split.size_in_mb=67108864

代碼

scala> val df = spark.sql("SELECT test from hello") // Billion rows in hello and test column is 1KB 

df: org.apache.spark.sql.DataFrame = [test: binary] 

scala> df.count 

[Stage 0:> (0 + 2)/13] // I dont know what these numbers mean precisely.

如果我調用火花殼如下

./spark-shell --conf spark.cassandra.connection.host=170.99.99.134

代碼

val df = spark.sql("SELECT test from hello") // This has about billion rows 

scala> df.count 


[Stage 0:=> (686 + 2)/24686] // What are these numbers precisely?

這兩個版本都不起作用Spark永遠在運行，我一直在等待超過15分鐘而沒有響應。關於什麼可能是錯誤的以及如何解決這個問題的任何想法？

我使用星火2.0.2 和火花卡桑德拉 - connector_2.11-2.0.0-M3.jar

來源

2016-11-24 user1870400

Dataset.count是緩慢的，因爲它時，它涉及到外部數據源是不是很聰明。它改寫查詢作爲（這是好事）：

SELECT COUNT(1) FROM table

但不是推COUNT下它只是執行：

SELECT 1 FROM table

對源（它會獲取一個十億的人在你的情況）和然後在本地聚合以獲得最終結果。你看到的數字是任務計數器。

上有CassandraRDD一個優化cassandraCount操作：

sc.cassandraTable(keyspace, table).cassandraCount

更多關於服務器側操作可以在the documentation找到。

來源

2016-11-24 08:34:50 user6910411

Apache Spark SQL從Cassandra中獲取數十億行數據？

回答

相關問題