2017-07-17 25 views
0

我想在Pyspark中定義一個UseDefinedFunction來處理數據幀的實例值,以便基於這些值生成新的屬性。如何通過數據框使用UserDefinedFunction來解決「Method __getnewargs __([])不存在」的錯誤?

我有這樣的代碼:

# I have a global attribute named 'global_dataframe' which is a dataframe containing some interesant instances. 

from pyspark.sql.functions import UserDefinedFunction 


def method1(instance, list_attribute_names): 

    if(instance['Att1'] != '-'): 
     return instance['Att1'] 
    else: 
     i = 0 
     result = "-" 
     query_SQL = "" 
     while(i < len(list_attribute_names)): 
      atr_imp = lista_2[i] 
      query_SQL = query_SQL + atr_imp + " = '" + instance[atr_imp] + "'," 
      i = i + 1 
     query_SQL = query_SQL[:-1] 

     # Here I filter the global_dataframe to get the results which are interesting according to the query generated before with the values of the instance passed to the method as a parameter 
     result_df = global_dataframe.filter(query_SQL) 
     if(result_df.head() != None):# If dataframe is empty 
      result = "None" 
     else: 
      result = query_SQL 
     return result 

def method0(df, important_attributes): 


    udf_func = UserDefinedFunction(lambda instance: method1(instance, important_attributes), StringType()) 
    column = udf_func(df) 


    df = df.withColumnRenamed("Att1", column) 
    return df 

當我執行這一行:

example = method0(dataframe_example, attribute_list_example) 

我得到了一個錯誤:

y4JError: An error occurred while calling o710.__getnewargs__. Trace: 
py4j.Py4JException: Method __getnewargs__([]) does not exist 
at 
py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) 
at 
py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) 
    at py4j.Gateway.invoke(Gateway.java:272) 
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) 
    at py4j.commands.CallCommand.execute(CallCommand.java:79) 
    at py4j.GatewayConnection.run(GatewayConnection.java:214) 
    at java.lang.Thread.run(Thread.java:748) 

這段代碼的想法是通過數據框執行method0並根據其屬性值獲取另一列。

我該如何解決這個錯誤?

回答

0

我認爲問題在於你正試圖序列化全局變量global_dataframe。執行者不應該嘗試調用DataFrame上的操作 - 這沒有任何意義。執行者的範圍僅在DataFrame(或RDD)的個別行上運行。

您可以通過評估global_dataframe是否爲空,並將參數is_empty傳遞給方法method1來解決此問題。

+0

所以,你建議傳遞一個參數is_empty來確保數據幀是否爲空? – jartymcfly

+0

那麼,是否有機會將lambda函數中的數據框實例傳遞給UDF,並在該函數中檢查該實例的不同屬性的值? – jartymcfly

+0

不,沒有辦法將'DataFrame'傳遞給UDF。 UDF只能在單個數據點上運行,而不能在整個數據集上運行。關鍵是你不能在執行程序執行的函數中調用'DataFrame'或'RDD'上的函數。 – timchap

相關問題