獲取列中唯一行數的最優化方法火花數據幀

我正在做這個MOOC以進行火花刷新，並遇到了這個問題「在先前創建的數據框中找到唯一主機的號碼」 Apache日誌分析）獲取列中唯一行數的最優化方法火花數據幀

數據幀看起來像這樣

--------------------+--------------------+------+------------+--------------------+ 
|    host|    path|status|content_size|    time| 
+--------------------+--------------------+------+------------+--------------------+ 
| in24.inetnebr.com |/shuttle/missions...| 200|  1839|1995-08-01 00:00:...| 
| uplherc.upl.com |     /| 304|   0|1995-08-01 00:00:...| 
| uplherc.upl.com |/images/ksclogo-m...| 304|   0|1995-08-01 00:00:...| 
| uplherc.upl.com |/images/MOSAIC-lo...| 304|   0|1995-08-01 00:00:...| 
| uplherc.upl.com |/images/USA-logos...| 304|   0|1995-08-01 00:00:...| 
|ix-esc-ca2-07.ix....|/images/launch-lo...| 200|  1713|1995-08-01 00:00:...| 
| uplherc.upl.com |/images/WORLD-log...| 304|   0|1995-08-01 00:00:...| 
|slppp6.intermind....|/history/skylab/s...| 200|  1687|1995-08-01 00:00:...| 
|piweba4y.prodigy....|/images/launchmed...| 200|  11853|1995-08-01 00:00:...| 
|slppp6.intermind....|/history/skylab/s...| 200|  9202|1995-08-01 00:00:...| 
|slppp6.intermind....|/images/ksclogosm...| 200|  3635|1995-08-01 00:00:...| 
|ix-esc-ca2-07.ix....|/history/apollo/i...| 200|  1173|1995-08-01 00:00:...| 
|slppp6.intermind....|/history/apollo/i...| 200|  3047|1995-08-01 00:00:...| 
| uplherc.upl.com |/images/NASA-logo...| 304|   0|1995-08-01 00:00:...| 
|  133.43.96.45 |/shuttle/missions...| 200|  10566|1995-08-01 00:00:...| 
|kgtyk4.kj.yamagat...|     /| 200|  7280|1995-08-01 00:00:...| 
|kgtyk4.kj.yamagat...|/images/ksclogo-m...| 200|  5866|1995-08-01 00:00:...| 
| d0ucr6.fnal.gov |/history/apollo/a...| 200|  2743|1995-08-01 00:00:...|

現在，我已經試過3種方法找到沒有唯一的主機

from pyspark.sql import functions as func 
unique_host_count = logs_df.agg(func.countDistinct(col("host"))).head()[0]

這個運行在約0.72秒

unique_host_count = logs_df.select("host").distinct().count()

這個運行在0.57秒

unique_host = logs_df.groupBy("host").count() 
unique_host_count = unique_host.count()

這個運行在0.62秒

所以我的問題是有沒有什麼比第二個更好的選擇，我認爲distinct是一個昂貴的操作，但它原來是最快的

The data frame I am using have 

1043177 rows 

spark version - 1.6.1 
cluster -6 gb memory

來源

2016-07-24 Bg1850

這不是低效的。因爲它確定每個分區的所有唯一值（不要忘記您的數據在多個節點中分割）。之後，它會比較這些值並在所有節點中選取所有不同的值，並且這很容易並行化。另一方面，當您對數據進行分組時，Spark會對數據進行混洗，而且這樣更加昂貴，因爲在大多數情況下，您必須密集使用網絡。

來源

2016-07-24 22:26:25

我懷疑這將導致洗牌也，但它是一個選項

logs_df.dropDuplicates("host").count

來源

2016-07-25 16:40:51 mark

您使用使用地圖&減少操作完成它：

unique_host_count = logs_df.select("host")\ 
.map(lambda x: (x, 1))\ 
.reduceByKey(lambda x, y: x+y)\ 
.count()

來源

2016-08-15 18:53:31 KiranM

獲取列中唯一行數的最優化方法火花數據幀

回答

相關問題