2017-09-13 38 views
0

我有一個數據框如下(我發佈了它的一部分),我需要將它保存在txt文件中,但是,我們做到了,它節省了大量的空文件並在日誌文件中向我顯示以下消息。我應該提到的是,我使用Mac OS和IntelliJ IDEA。你能幫我解決我的錯誤嗎?謝謝。空的txt文件保存在scala中的Apache Spark

+-----------+-------------+-----+----+---+--------------------+------------------+---+---+---+---+--------------------+--------------------+-------------------+-----------------+------+ 
| time_stamp_0|sender_ip_1|receiver_ip_2|count|rank| xi|     pi|     r|ip5|ip4|ip3|ip2|   variance|    entropy|  pre_chi_square| total_chi_square|attack| 
+---------------+-----------+-------------+-----+----+---+--------------------+------------------+---+---+---+---+--------------------+--------------------+-------------------+-----------------+------+ 
|09:06:41.053816| 10.0.0.5|  10.0.0.1| 297| 1| 20|0.003367003367003367|0.8855218855218855| 20| 13| 1|263|4.412538280964408E-5| 0.01917081528216397| 16.055555555555557|64.22222222222223|  1| 
|09:06:41.565362| 10.0.0.5|  10.0.0.1| 297| 2| 20|0.006734006734006734|0.8855218855218855| 20| 13| 1|263|0.004182025143605029| 0.03367397278277949| 14.222222222222221|64.22222222222223|  1| 
|09:06:41.570799| 10.0.0.5|  10.0.0.1| 297| 3| 20|0.010101010101010102|0.8855218855218855| 20| 13| 1|263|0.015053931638407148|0.046415352021561516|    12.5|64.22222222222223|  1| 
|09:06:42.093127| 10.0.0.5|  10.0.0.1| 297| 4| 20|0.013468013468013467|0.8855218855218855| 20| 13| 1|263| 0.032659844867216|0.058012630002462075| 10.88888888888889|64.22222222222223|  1| 
|09:06:42.617228| 10.0.0.5|  10.0.0.1| 297| 5| 20|0.016835016835016835|0.8855218855218855| 20| 13| 1|263| 0.05699976483003157| 0.06875916206007743| 9.38888888888889|64.22222222222223|  1| 
|09:06:43.141217| 10.0.0.5|  10.0.0.1| 297| 6| 20|0.020202020202020204|0.8855218855218855| 20| 13| 1|263| 0.08807369152685389| 0.07882773069847768|    8.0|64.22222222222223|  1| 
|09:06:43.665672| 10.0.0.5|  10.0.0.1| 297| 7| 20| 0.02356902356902357|0.8855218855218855| 20| 13| 1|263| 0.12588162495768296| 0.08833250480886096| 6.722222222222222|64.22222222222223|  1| 
|09:06:44.189268| 10.0.0.5|  10.0.0.1| 297| 8| 20|0.026936026936026935|0.8855218855218855| 20| 13| 1|263| 0.17042356512251874| 0.09735462887873032| 5.555555555555555|64.22222222222223|  1| 
|09:06:44.192995| 

了出來放的圖片如下:(成功文件和其他文件是空的) enter image description here 在日誌文件中的消息:

17/09/13 10:42:50 INFO ParquetWriteSupport: Initialized Parquet WriteSupport with Catalyst schema: 
{ 
    "type" : "struct", 
    "fields" : [ { 
    "name" : "time_stamp_0", 
    "type" : "string", 
    "nullable" : true, 
    "metadata" : { } 
    }, { 
    "name" : "sender_ip_1", 
    "type" : "string", 
    "nullable" : true, 
    "metadata" : { } 
    }, { 
    "name" : "receiver_ip_2", 
    "type" : "string", 
    "nullable" : true, 
    "metadata" : { } 
    }, { 
    "name" : "count", 
    "type" : "long", 
    "nullable" : false, 
    "metadata" : { } 
    }, { 
    "name" : "rank", 
    "type" : "integer", 
    "nullable" : true, 
    "metadata" : { } 
    }, { 
    "name" : "xi", 
    "type" : "long", 
    "nullable" : false, 
    "metadata" : { } 
    }, { 
    "name" : "pi", 
    "type" : "double", 
    "nullable" : true, 
    "metadata" : { } 
    }, { 
    "name" : "r", 
    "type" : "double", 
    "nullable" : false, 
    "metadata" : { } 
    }, { 
    "name" : "ip5", 
    "type" : "long", 
    "nullable" : false, 
    "metadata" : { } 
    }, { 
    "name" : "ip4", 
    "type" : "long", 
    "nullable" : false, 
    "metadata" : { } 
    }, { 
    "name" : "ip3", 
    "type" : "long", 
    "nullable" : false, 
    "metadata" : { } 
    }, { 
    "name" : "ip2", 
    "type" : "long", 
    "nullable" : false, 
    "metadata" : { } 
    }, { 
    "name" : "variance", 
    "type" : "double", 
    "nullable" : true, 
    "metadata" : { } 
    }, { 
    "name" : "entropy", 
    "type" : "double", 
    "nullable" : true, 
    "metadata" : { } 
    }, { 
    "name" : "pre_chi_square", 
    "type" : "double", 
    "nullable" : true, 
    "metadata" : { } 
    }, { 
    "name" : "total_chi_square", 
    "type" : "double", 
    "nullable" : false, 
    "metadata" : { } 
    }, { 
    "name" : "attack", 
    "type" : "integer", 
    "nullable" : false, 
    "metadata" : { } 
    } ] 
} 
and corresponding Parquet message type: 
message spark_schema { 
    optional binary time_stamp_0 (UTF8); 
    optional binary sender_ip_1 (UTF8); 
    optional binary receiver_ip_2 (UTF8); 
    required int64 count; 
    optional int32 rank; 
    required int64 xi; 
    optional double pi; 
    required double r; 
    required int64 ip5; 
    required int64 ip4; 
    required int64 ip3; 
    required int64 ip2; 
    optional double variance; 
    optional double entropy; 
    optional double pre_chi_square; 
    required double total_chi_square; 
    required int32 attack; 
} 

這裏是我的代碼:

final_dataframe.write.save("/Users/saeedtkh/Desktop/Testoutput") 
+0

您是否在寫入文件時嘗試使用coalesce? –

+0

@RameshMaharjan:謝謝你回答我的朋友。那是什麼??我不這麼認爲...... – Queen

+1

如果你只需要一個輸出文件,那麼你可以把'.coalesce(1)'作爲'final_add_count_rank_xi_pi_r_attack.coalesce(1).write.save(「/ Users/saeedtkh/Desktop/Testoutput 「)'。如果你想要更多的文件,你可以增加數量。 :) –

回答

0

根據回答一個號,再輸出爲另存爲空文件,但文件的數量可以降低。 我使用了數據磚庫。它以CSV文件形式輸出。作爲以CSV格式保存的文件,我將其更改爲txt。感謝第一個答案,我使用了上述代碼來減少輸出的數量。以下是我使用的代碼:

 sqlContext.setConf("spark.sql.shuffle.partitions","1") 
final_add_count_rank_xi_pi_r_attack.write.format("com.databricks.spark.csv").option("header",true).option("inferSchema", true).save("/Users/saeedtkh/Desktop/Testoutput") 
1

您的文件總數是否接近200?

嘗試將默認shuffle.partitons設置爲較少的數字。 說喜歡: sqlContext.setConf(「spark.sql.shuffle.partitions」,「5」)