2016-02-19 47 views
1

我有一個RDD的字典,我想獲得RDD的不同元素。然而,當我嘗試調用我怎樣才能在PySpark中獲得不同的RDD的字典?

rdd.distinct() 

PySpark給我下面的錯誤

TypeError: unhashable type: 'dict' 

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166) 
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207) 
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125) 
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
    at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:342) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) 
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) 
    at org.apache.spark.scheduler.Task.run(Task.scala:89) 
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
    at java.lang.Thread.run(Thread.java:745) 
16/02/19 16:55:56 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last): 
    File "/usr/local/Cellar/apache-spark/1.6.0/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main 
    process() 
    File "/usr/local/Cellar/apache-spark/1.6.0/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process 
    serializer.dump_stream(func(split_index, iterator), outfile) 
    File "/usr/local/Cellar/apache-spark/1.6.0/libexec/python/lib/pyspark.zip/pyspark/rdd.py", line 2346, in pipeline_func 
    File "/usr/local/Cellar/apache-spark/1.6.0/libexec/python/lib/pyspark.zip/pyspark/rdd.py", line 2346, in pipeline_func 
    File "/usr/local/Cellar/apache-spark/1.6.0/libexec/python/lib/pyspark.zip/pyspark/rdd.py", line 317, in func 
    File "/usr/local/Cellar/apache-spark/1.6.0/libexec/python/lib/pyspark.zip/pyspark/rdd.py", line 1776, in combineLocally 
    File "/usr/local/Cellar/apache-spark/1.6.0/libexec/python/lib/pyspark.zip/pyspark/shuffle.py", line 238, in mergeValues 
    d[k] = comb(d[k], v) if k in d else creator(v) 
TypeError: unhashable type: 'dict' 

我有,我可以爲不同的元素使用的字典裏面的關鍵,但文件沒有按」如何解決這個問題,不能提供任何線索。

編輯:內容由字符串,數組串,和數字

EDIT 2的詞典:字典的示例...我想與等於「data_fingerprint http://stardict.sourceforge.net/Dictionaries.php下載正如@ zero323在他的評論中指出,你必須決定如何爲他們AR比較字典

{"id":"4eece341","data_fingerprint":"1707db7bddf011ad884d132bf80baf3c"} 

感謝

+1

詞典的內容究竟是什麼?你想如何散列這些? – zero323

+0

有問題回答 – noli

+1

這還不夠。你需要一個確切的策略來比較字典。這些不可哈希有兩個原因:可變性和未定義的順序。在你的情況下情況更糟,因爲它也包含不可取值。所以問題是什麼使字典對你有同樣的作用? – zero323

回答

2

:「鑰匙被視爲相等e不可散列。一種方法是對鍵進行排序(因爲它們不是以任何特定的順序),例如通過詞法順序。然後創建一個表格的字符串:

def dict_to_string(dict): 
    ... 
    return 'key1|value1|key2|value2...|keyn|valuen' 

如果你有嵌套的不可對象,你必須遞歸地做這件事。

現在你可以改變你的RDD與字符串對作爲一個鍵(或某種其散列)

pairs = dictRDD.map(lambda d: (dict_to_string(d), d)) 

爲了得到你想要的,你只需要通過按鍵來減少爲休耕

distinctDicts = pairs.reduceByKey(lambda val1, val2: val1).values() 
1

由於您的數據提供了一個獨特的鍵,你可以簡單地做這樣的事情:

(rdd 
    .keyBy(lambda d: d.get("data_fingerprint")) 
    .reduceByKey(lambda x, y: x) 
    .values()) 

至少有兩個問題Python字典,這使得它們的哈希,壞的候選人:

  • 易變性 - 這使得鍵

前一段時間有一個的任何散列棘手

  • 任意順序PEP提議frozerdictsPEP 0416),但最終被拒絕。

  • 相關問題