這裏是我能想到的,因爲數據磚模塊似乎並沒有提供一個跳過行選項的幾個選項:
選擇一個:在第一行的前面加上「#」字符,並且該行將被自動視爲註釋並被data.bricks csv模塊忽略;
選擇二:創建自定義模式,並指定mode
選項爲DROPMALFORMED
,因爲它含有較少的令牌在customSchema於預期,這將下降的第一行:
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
val customSchema = StructType(Array(StructField("id", IntegerType, true),
StructField("name", StringType, true),
StructField("age", IntegerType, true)))
val df = sqlContext.read.format("com.databricks.spark.csv").
option("header", "true").
option("mode", "DROPMALFORMED").
schema(customSchema).load("test.txt")
df.show
16/06/12 21:24:05 WARN CsvRelation $:數字格式異常。刪除 畸形行:ID,姓名,年齡
+---+----+---+
| id|name|age|
+---+----+---+
| 1| abc| 12|
| 2| bcd| 33|
+---+----+---+
注意這裏的警告消息,它說下跌畸形行:
選擇三:寫自己的解析器來丟棄沒有按」行t長度爲三:
val file = sc.textFile("pathToYourCsvFile")
val df = file.map(line => line.split(",")).
filter(lines => lines.length == 3 && lines(0)!= "id").
map(row => (row(0), row(1), row(2))).
toDF("id", "name", "age")
df.show
+---+----+---+
| id|name|age|
+---+----+---+
| 1| abc| 12|
| 2| bcd| 33|
+---+----+---+