將CSV轉換爲Parquet錯誤

我的python腳本有問題。我的腳本應該將我的csv轉換爲parquet文件。但是，當我執行我的腳本我有這個錯誤：將CSV轉換爲Parquet錯誤

py4j.protocol.Py4JJavaError: An error occurred while calling o59.csv. : java.io.IOException: No FileSystem for scheme: null

什麼是o59.csv？它不是我目前的文件...

這是我的腳本

from pyspark import SparkContext 
from pyspark.sql import SQLContext 
from pyspark.sql import SparkSession 
from pyspark.sql.types import * 

spark = SparkSession.builder \ 
    .appName ("Convert into Parquet") \ 
    .config("spark.some.config.option", "some-value") \ 
    .getOrCreate() 


schema = StructType([ 
    StructField("date", DateType(),True), 
    StructField("semaine",IntegerType(),True), 
    StructField("annee",IntegerType(),True), 
    StructField("mois",IntegerType(),True)]) 

# read csv 
df = spark.read.csv('//home/sshuser3/calendrier.csv', header=True) 

# Displays the content of the DataFrame to stdout 
df = sqlContext.createDataFrame(rdd,schema) 


df.write.parquet('//home/sshuser3/outputParquet/calendrier.parquet')

你對我有什麼建議嗎？

整個消息是：

spark-submit Convert.py 
SPARK_MAJOR_VERSION is set to 2, using Spark2 
17/08/11 15:01:58 INFO SparkContext: Running Spark version 2.1.0.2.6.0.10-29 
17/08/11 15:01:59 INFO SecurityManager: Changing view acls to: sshuser3 
17/08/11 15:01:59 INFO SecurityManager: Changing modify acls to: sshuser3 
17/08/11 15:01:59 INFO SecurityManager: Changing view acls groups to: 
17/08/11 15:01:59 INFO SecurityManager: Changing modify acls groups to: 
17/08/11 15:01:59 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(sshuser3); groups with view permissions: Set(); users with modify permissions: Set(sshuser3); groups with modify permissions: Set() 
17/08/11 15:01:59 INFO Utils: Successfully started service 'sparkDriver' on port 44872. 
17/08/11 15:01:59 INFO SparkEnv: Registering MapOutputTracker 
17/08/11 15:01:59 INFO SparkEnv: Registering BlockManagerMaster 
17/08/11 15:01:59 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 
17/08/11 15:01:59 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up 
17/08/11 15:01:59 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-cf215ce8-b19e-4270-9c62-57a37dea703b 
17/08/11 15:01:59 INFO MemoryStore: MemoryStore started with capacity 366.3 MB 
17/08/11 15:01:59 INFO SparkEnv: Registering OutputCommitCoordinator 
17/08/11 15:01:59 INFO log: Logging initialized @2107ms 
17/08/11 15:01:59 INFO Server: jetty-9.2.z-SNAPSHOT 
17/08/11 15:01:59 INFO Server: Started @2176ms 
17/08/11 15:01:59 INFO ServerConnector: Started [email protected]{HTTP/1.1}{0.0.0.0:4040} 
17/08/11 15:01:59 INFO Utils: Successfully started service 'SparkUI' on port 4040. 
17/08/11 15:01:59 INFO ContextHandler: Started [email protected]{/jobs,null,AVAILABLE,@Spark} 
17/08/11 15:01:59 INFO ContextHandler: Started [email protected]{/jobs/json,null,AVAILABLE,@Spark} 
17/08/11 15:01:59 INFO ContextHandler: Started [email protected]{/jobs/job,null,AVAILABLE,@Spark} 
17/08/11 15:01:59 INFO ContextHandler: Started [email protected]{/jobs/job/json,null,AVAILABLE,@Spark} 
17/08/11 15:01:59 INFO ContextHandler: Started [email protected]{/stages,null,AVAILABLE,@Spark} 
17/08/11 15:01:59 INFO ContextHandler: Started [email protected]{/stages/json,null,AVAILABLE,@Spark} 
17/08/11 15:01:59 INFO ContextHandler: Started [email protected]{/stages/stage,null,AVAILABLE,@Spark} 
17/08/11 15:01:59 INFO ContextHandler: Started [email protected]{/stages/stage/json,null,AVAILABLE,@Spark} 
17/08/11 15:01:59 INFO ContextHandler: Started [email protected]{/stages/pool,null,AVAILABLE,@Spark} 
17/08/11 15:01:59 INFO ContextHandler: Started [email protected]{/stages/pool/json,null,AVAILABLE,@Spark} 
17/08/11 15:01:59 INFO ContextHandler: Started [email protected]{/storage,null,AVAILABLE,@Spark} 
17/08/11 15:01:59 INFO ContextHandler: Started [email protected]{/storage/json,null,AVAILABLE,@Spark} 
17/08/11 15:01:59 INFO ContextHandler: Started [email protected]{/storage/rdd,null,AVAILABLE,@Spark} 
17/08/11 15:01:59 INFO ContextHandler: Started [email protected]{/storage/rdd/json,null,AVAILABLE,@Spark} 
17/08/11 15:01:59 INFO ContextHandler: Started [email protected]{/environment,null,AVAILABLE,@Spark} 
17/08/11 15:01:59 INFO ContextHandler: Started [email protected]{/environment/json,null,AVAILABLE,@Spark} 
17/08/11 15:01:59 INFO ContextHandler: Started [email protected]{/executors,null,AVAILABLE,@Spark} 
17/08/11 15:01:59 INFO ContextHandler: Started [email protected]{/executors/json,null,AVAILABLE,@Spark} 
17/08/11 15:01:59 INFO ContextHandler: Started [email protected]{/executors/threadDump,null,AVAILABLE,@Spark} 
17/08/11 15:01:59 INFO ContextHandler: Started [email protected]{/executors/threadDump/json,null,AVAILABLE,@Spark} 
17/08/11 15:01:59 INFO ContextHandler: Started [email protected]{/static,null,AVAILABLE,@Spark} 
17/08/11 15:01:59 INFO ContextHandler: Started [email protected]{/,null,AVAILABLE,@Spark} 
17/08/11 15:01:59 INFO ContextHandler: Started [email protected]{/api,null,AVAILABLE,@Spark} 
17/08/11 15:01:59 INFO ContextHandler: Started [email protected]{/jobs/job/kill,null,AVAILABLE,@Spark} 
17/08/11 15:01:59 INFO ContextHandler: Started [email protected]{/stages/stage/kill,null,AVAILABLE,@Spark} 
17/08/11 15:01:59 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.0.0.18:4040 
17/08/11 15:02:00 INFO RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1, rm2]... 
17/08/11 15:02:00 INFO RequestHedgingRMFailoverProxyProvider: Found active RM [rm2] 
17/08/11 15:02:00 INFO Client: Requesting a new application from cluster with 2 NodeManagers 
17/08/11 15:02:00 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (25600 MB per container) 
17/08/11 15:02:00 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead 
17/08/11 15:02:00 INFO Client: Setting up container launch context for our AM 
17/08/11 15:02:00 INFO Client: Setting up the launch environment for our AM container 
17/08/11 15:02:00 INFO Client: Preparing resources for our AM container 
17/08/11 15:02:02 INFO Client: Uploading resource file:/usr/hdp/current/spark2-client/python/lib/pyspark.zip -> adl://home/user/sshuser3/.sparkStaging/application_1501619838490_0047/pyspark.zip 
17/08/11 15:02:03 INFO Client: Uploading resource file:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip -> adl://home/user/sshuser3/.sparkStaging/application_1501619838490_0047/py4j-0.10.4-src.zip 
17/08/11 15:02:03 INFO Client: Uploading resource file:/tmp/spark-813e8548-f7f4-4ff7-80f4-236e0e3b5ee2/__spark_conf__630655753844577263.zip -> adl://home/user/sshuser3/.sparkStaging/application_1501619838490_0047/__spark_conf__.zip 
17/08/11 15:02:03 INFO SecurityManager: Changing view acls to: sshuser3 
17/08/11 15:02:03 INFO SecurityManager: Changing modify acls to: sshuser3 
17/08/11 15:02:03 INFO SecurityManager: Changing view acls groups to: 
17/08/11 15:02:03 INFO SecurityManager: Changing modify acls groups to: 
17/08/11 15:02:03 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(sshuser3); groups with view permissions: Set(); users with modify permissions: Set(sshuser3); groups with modify permissions: Set() 
17/08/11 15:02:03 INFO Client: Submitting application application_1501619838490_0047 to ResourceManager 
17/08/11 15:02:03 INFO YarnClientImpl: Submitted application application_1501619838490_0047 
17/08/11 15:02:03 INFO SchedulerExtensionServices: Starting Yarn extension services with app application_1501619838490_0047 and attemptId None 
17/08/11 15:02:04 INFO Client: Application report for application_1501619838490_0047 (state: ACCEPTED) 
17/08/11 15:02:04 INFO Client: 
    client token: N/A 
    diagnostics: AM container is launched, waiting for AM container to Register with RM 
    ApplicationMaster host: N/A 
    ApplicationMaster RPC port: -1 
    queue: default 
    start time: 1502463723834 
    final status: UNDEFINED 
    tracking URL: http://hn1-pzhdla.stikrkashgqefoqg3xnv2chu0d.fx.internal.cloudapp.net:8088/proxy/application_1501619838490_0047/ 
    user: sshuser3 
17/08/11 15:02:05 INFO Client: Application report for application_1501619838490_0047 (state: ACCEPTED) 
17/08/11 15:02:06 INFO Client: Application report for application_1501619838490_0047 (state: ACCEPTED) 
17/08/11 15:02:07 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(null) 
17/08/11 15:02:07 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> hn0-pzhdla.stikrkashgqefoqg3xnv2chu0d.fx.internal.cloudapp.net,hn1-pzhdla.stikrkashgqefoqg3xnv2chu0d.fx.internal.cloudapp.net, PROXY_URI_BASES -> http://hn0-pzhdla.stikrkashgqefoqg3xnv2chu0d.fx.internal.cloudapp.net:8088/proxy/application_1501619838490_0047,http://hn1-pzhdla.stikrkashgqefoqg3xnv2chu0d.fx.internal.cloudapp.net:8088/proxy/application_1501619838490_0047), /proxy/application_1501619838490_0047 
17/08/11 15:02:07 INFO JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter 
17/08/11 15:02:07 INFO Client: Application report for application_1501619838490_0047 (state: RUNNING) 
17/08/11 15:02:07 INFO Client: 
    client token: N/A 
    diagnostics: N/A 
    ApplicationMaster host: 10.0.0.7 
    ApplicationMaster RPC port: 0 
    queue: default 
    start time: 1502463723834 
    final status: UNDEFINED 
    tracking URL: http://hn1-pzhdla.stikrkashgqefoqg3xnv2chu0d.fx.internal.cloudapp.net:8088/proxy/application_1501619838490_0047/ 
    user: sshuser3 
17/08/11 15:02:07 INFO YarnClientSchedulerBackend: Application application_1501619838490_0047 has started running. 
17/08/11 15:02:07 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 37070. 
17/08/11 15:02:07 INFO NettyBlockTransferService: Server created on 10.0.0.18:37070 
17/08/11 15:02:07 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 
17/08/11 15:02:07 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.0.0.18, 37070, None) 
17/08/11 15:02:07 INFO BlockManagerMasterEndpoint: Registering block manager 10.0.0.18:37070 with 366.3 MB RAM, BlockManagerId(driver, 10.0.0.18, 37070, None) 
17/08/11 15:02:07 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.0.0.18, 37070, None) 
17/08/11 15:02:07 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.0.0.18, 37070, None) 
17/08/11 15:02:08 INFO ContextHandler: Started [email protected]{/metrics/json,null,AVAILABLE,@Spark} 
17/08/11 15:02:08 INFO EventLoggingListener: Logging events to adl:///hdp/spark2-events/application_1501619838490_0047 
17/08/11 15:02:10 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.0.0.7:51484) with ID 1 
17/08/11 15:02:10 INFO BlockManagerMasterEndpoint: Registering block manager 10.0.0.7:40909 with 3.0 GB RAM, BlockManagerId(1, 10.0.0.7, 40909, None) 
17/08/11 15:02:12 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.0.0.4:42318) with ID 2 
17/08/11 15:02:12 INFO BlockManagerMasterEndpoint: Registering block manager 10.0.0.4:46697 with 3.0 GB RAM, BlockManagerId(2, 10.0.0.4, 46697, None) 
17/08/11 15:02:12 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8 
17/08/11 15:02:12 INFO SharedState: Warehouse path is 'file:/home/sshuser3/spark-warehouse/'. 
17/08/11 15:02:12 INFO ContextHandler: Started [email protected]{/SQL,null,AVAILABLE,@Spark} 
17/08/11 15:02:12 INFO ContextHandler: Started [email protected]{/SQL/json,null,AVAILABLE,@Spark} 
17/08/11 15:02:12 INFO ContextHandler: Started [email protected]{/SQL/execution,null,AVAILABLE,@Spark} 
17/08/11 15:02:12 INFO ContextHandler: Started [email protected]{/SQL/execution/json,null,AVAILABLE,@Spark} 
17/08/11 15:02:12 INFO ContextHandler: Started [email protected]{/static/sql,null,AVAILABLE,@Spark} 
17/08/11 15:02:12 WARN DataSource: Error while looking for metadata directory. 
Traceback (most recent call last): 
    File "/home/sshuser3/Convert.py", line 22, in <module> 
    df = spark.read.csv('//home/sshuser3/calendrier.csv',header = True) 
    File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 380, in csv 
    File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ 
    File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco 
    File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value 
py4j.protocol.Py4JJavaError: An error occurred while calling o59.csv. 
: java.io.IOException: No FileSystem for scheme: null 
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2786) 
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2793) 
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99) 
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2829) 
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2811) 
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:390) 
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) 
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:372) 
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370) 
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) 
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) 
    at scala.collection.immutable.List.foreach(List.scala:381) 
    at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) 
    at scala.collection.immutable.List.flatMap(List.scala:344) 
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370) 
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152) 
    at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:415) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
    at java.lang.reflect.Method.invoke(Method.java:498) 
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) 
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) 
    at py4j.Gateway.invoke(Gateway.java:280) 
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) 
    at py4j.commands.CallCommand.execute(CallCommand.java:79) 
    at py4j.GatewayConnection.run(GatewayConnection.java:214) 
    at java.lang.Thread.run(Thread.java:748)

之後，他們更多的日誌，但

FaigB已經提出的問題可能是調用createDataFrame功能，但我認爲並不重要問題出現在這個函數之前。這是日誌，我們看到它只是在read.csv後追加：

Traceback (most recent call last): 
    File "/home/sshuser3/Convert.py", line 21, in <module> 
    df = spark.read.csv('//home/sshuser3/calendrier.csv', header=True) 
    File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 380, in csv 
    File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ 
    File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco 
    File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value 
py4j.protocol.Py4JJavaError: An error occurred while calling o59.csv. 
: java.io.IOException: No FileSystem for scheme: null 
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2786) 
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2793) 
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99) 
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2829) 
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2811) 
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:390) 
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) 
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:372) 
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370) 
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) 
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) 
    at scala.collection.immutable.List.foreach(List.scala:381) 
    at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) 
    at scala.collection.immutable.List.flatMap(List.scala:344) 
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370) 
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152) 
    at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:415) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
    at java.lang.reflect.Method.invoke(Method.java:498) 
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) 
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) 
    at py4j.Gateway.invoke(Gateway.java:280) 
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) 
    at py4j.commands.CallCommand.execute(CallCommand.java:79) 
    at py4j.GatewayConnection.run(GatewayConnection.java:214) 
    at java.lang.Thread.run(Thread.java:748)

來源

2017-08-11 Kevin F

Spark集羣中的某些節點是否可能存在文件系統問題（在其本地文件系統或HDFS層上）？也許o59.csv是一個分區/ caldendrier.csv塊，分發給某個節點，並且該節點上運行的進程遇到I/O錯誤？ –

我必須承認我在這裏拍了一下臀部。但是我會從更新Java開始，也許要確保它是正確運行腳本的Python版本（2.7 vs 3.x）。如果該Python版本與您當前的Java版本兼容。只是一些想法... – Zeth

嘗試'df.write.parquet（「file：/// home/sshuser ....」）'如果它是你的本地FS路徑或'hdfs：/// home/...'如果在HDFS – philantrovert

經過一番研究，我發現我的問題。我的路徑不正確寫入。我必須使用file：/ my path而不是//我的路徑。

所以這篇文章可以關閉。謝謝你的答案。

來源

2017-08-17 09:17:28

我建議仔細檢查你的代碼片段

# Displays the content of the DataFrame to stdout 
df = sqlContext.createDataFrame(rdd,schema)

看起來你應該首先你的數據幀轉換成RDD然後將其映射到構建的模式使用

rdd = df.rdd

我做了一個小實驗。

//read csv file 
df = spark.read.csv('/<path_to_csv>', header=True) 

//casting types for specific columns because loaded data is string at it has unicode prefix 
df = df.select(df.<column_name>.cast('timestamp'),df.<column_name>.cast('int'),df.<column_name>.cast('int'),df.<column_name>.cast('int')) 

//creating dataframe using schema 
dt = spark.createDataFrame(df.rdd,schema) 

//write as parquet 
dt.write.parquet('/path_to_parquet_file')

來源

2017-08-11 09:42:11 FaigB

這是行得通嗎？因爲就像我說的，問題似乎來自我的函數df = spark.read.csv。之後，我什麼也做不了。我試圖打印一些東西，看它出現在日誌中，但它不工作... –

是的。這是工作繁瑣的部分鑄造列所需的類型。 – FaigB

好的，我試圖做到這一點，但它不適合我...我不知道腳本中的問題在哪裏.... –

將CSV轉換爲Parquet錯誤

回答

相關問題