PySpark 1.5＆MSSQL jdbc

我在Cloudera YARN上的Spark 1.5上使用PySpark，在Centos 6機器上使用Python 3.3。 SQL Server實例是SQL Server Enterprise 64位。下面列出了SQL Server驅動程序; sqljdbc4.jar;我已經加入到我的.bashrcPySpark 1.5＆MSSQL jdbc

export SPARK_CLASSPATH="/var/lib/spark/sqljdbc4.jar" 
export PYSPARK_SUBMIT_ARGS="--conf spark.executor.extraClassPath="/var/lib/spark/sqljdbc4.jar" --driver-class-path="/var/lib/spark/sqljdbc4.jar" --jars="/var/lib/spark/sqljdbc4.jar" --master yarn --deploy-mode client"

我能看到的確認，當我啓動火花

SPARK_CLASSPATH was detected (set to '/var/lib/spark/sqljdbc4.jar')

我有一個數據幀，看起來像這樣的模式

root 
|-- daytetime: timestamp (nullable = true) 
|-- ip: string (nullable = true) 
|-- tech: string (nullable = true) 
|-- th: string (nullable = true) 
|-- car: string (nullable = true) 
|-- min_dayte: timestamp (nullable = true) 
|-- max_dayte: timestamp (nullable = true)

我有在我的MS SQL服務器中創建了一個名爲'dbo.shaping'的空表，其中3個時間戳列將是datetime2（7）和其他nvarchar（50）。

我嘗試將數據幀從PySpark使用此

properties = {"user": "<username>", "password": "<password>"} 

df.write.format('jdbc').options(url='<IP>:1433/<dbname>', dbtable='dbo.shaping',driver="com.microsoft.sqlserver.jdbc.SQLServerDriver",properties=properties)

我得到以下追蹤誤差

Py4JError: An error occurred while calling o250.option. Trace: 
py4j.Py4JException: Method option([class java.lang.String, class java.util.HashMap]) does not exist 
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333) 
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342) 
at py4j.Gateway.invoke(Gateway.java:252) 
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) 
at py4j.commands.CallCommand.execute(CallCommand.java:79) 
at py4j.GatewayConnection.run(GatewayConnection.java:207) 
at java.lang.Thread.run(Thread.java:744)

是我的方法至少是正確的，也許這錯誤與寫作出口特定類型的數據，即，我有一個數據結構的問題，而不是我的代碼？

來源

2016-02-26 PR102012

你復活一個問題，這是一歲多。你有沒有證實它仍然是相關的問題（面對像更新版本的軟件）？ –

此下的軟件更新不可能。必須是pyspark 1.5解決方案。 – PR102012

pyspark 1.5是一回事，但SQL Server的Microsoft JDBC驅動程序也經歷了更新。您的錯誤具有組件之間版本不匹配的所有特徵，但不清楚哪些組件是錯誤的。我建議明確列出您使用的所有版本（python，pyspark，JDBC驅動程序，SQL Server，OS）的版本號，否則有人再現它的希望渺茫。（這也是爲什麼我懷疑這是「廣泛適用於大量觀衆」，但我沒有pyspark的經驗。） –

您不能使用dict作爲options的值。方法只需要str參數（Scala docs和PySpark annotations），並擴展爲分開調用Java option。

在當前星火版本值automatically converted to string，所以你的代碼會默默地消失，但it isn't the case in 1.5。

由於properties是特定於JDBC驅動程序，無論如何，你應該使用jdbc方法：當你看到

.options(
    url='<IP>:1433/<dbname>', 
    dbtable='dbo.shaping', 
    driver="com.microsoft.sqlserver.jdbc.SQLServerDriver", 
    **properties)

一般來說，：

properties = { 
    "user": "<username>", "password": "<password>", "driver": 
    "com.microsoft.sqlserver.jdbc.SQLServerDriver"} 

df.write.jdbc(
    url='<IP>:1433/<dbname>', 
    table='dbo.shaping', 
    properties=properties)

雖然拆包性質應該工作以及

py4j.Py4JException: Method ... does not exist

它通常表示loc之間的不匹配al Python類型以及使用中的JVM方法預期的類型。

參見：How to use JDBC source to write and read data in (Py)Spark?

來源

2017-07-08 09:51:10 zero323

我包含'user'，'password'和'driver'的屬性;就像你在這裏一樣。但是，我現在收到錯誤'Py4JJavaError：調用o230.jdbc時發生錯誤。：java.sql.SQLException：找不到合適的驅動程序。 ..是否有可能b/c我在YARN上，作爲驅動程序幷包含在我的Mgmt/Execution節點的.bashrc中的.jar文件不在每個其他非主節點中的相同目錄中？因此，當我使用多個節點時，有些沒有這個罐子？ – PR102012

JDBC驅動程序必須存在於每個工作節點上。就我個人而言，我會使用'--packages'選項，在客戶端模式下思考，您應該可以通過'--jars'傳遞本地jar。 – zero323

PySpark 1.5＆MSSQL jdbc

回答

相關問題