2013-03-19 135 views
1

我正在讀取csv文件以存儲在不可變的數據結構中。每一行都是一個入口。每個入口都有一個車站。每個站可以有多個入口。有沒有一種方法可以在一次傳球中做到這一點,而不是你在下面看到的雙傳球?優化csv文件的非規範化

object NYCSubwayEntrances { 
    def main(args: Array[String]) = { 
    import com.github.tototoshi.csv.CSVReader 
    //http://www.mta.info/developers/data/nyct/subway/StationEntrances.csv 
    val file = new java.io.File("StationEntrances.csv") 
    val reader = CSVReader.open(file) 
    reader.readNext //consume headers 
    val entranceMap = list2multimap(
     reader.all map { 
     case fields: List[String] => 
      // println(fields) 
      (
      fields(2), 
      Entrance(
       fields(14).toBoolean, 
       Option(fields(15)), 
       fields(16).toBoolean, 
       fields(17), 
       fields(18) match {case "YES" => true case _ => false}, 
       fields(19) match {case "YES" => true case _ => false}, 
       fields(20), 
       fields(21), 
       fields(22), 
       fields(23), 
       fields(24).toInt, 
       fields(25).toInt 
      ) 
     ) 
     } 
    ) 
    reader.close 
    val reader2 = CSVReader.open(file) 
    reader2.readNext //consume headers 
    val stations = reader2.all map { case fields: List[String] => 
     Station(
     fields(2), 
     fields(0), 
     fields(1), 
     colate(scala.collection.immutable.ListSet[String](
      fields(3), 
      fields(4), 
      fields(5), 
      fields(6), 
      fields(7), 
      fields(8), 
      fields(9), 
      fields(10), 
      fields(11), 
      fields(12), 
      fields(13) 
     )), 
     entranceMap(fields(2)).toList 
    ) 
    } 
    reader2.close 

    import net.liftweb.json._ 
    import net.liftweb.json.Serialization.write 
    implicit val formats = Serialization.formats(NoTypeHints) 
    println(pretty(render(parse(write(stations.toSet))))) 
    } 

    import scala.collection.mutable.{HashMap, Set, MultiMap} 

    def list2multimap[A, B](list: List[(A, B)]) = 
    list.foldLeft(new HashMap[A, Set[B]] with MultiMap[A, B]){(acc, pair) => acc.addBinding(pair._1, pair._2)} 

    def colate(set: scala.collection.immutable.ListSet[String]): List[String] = 
    ((List[String]() ++ set) diff List("")).reverse 
} 

case class Station(name: String, division: String, line: String, routes: List[String], entrances: List[Entrance]) {} 
case class Entrance(ada: Boolean, adaNotes: Option[String], freeCrossover: Boolean, entranceType: String, entry: Boolean, exitOnly: Boolean, entranceStaffing: String, northSouthStreet: String, eastWestStreet: String, corner: String, latitude: Integer, longitude: Integer) {} 

與所有正確的依賴關係的SBT項目可以在 https://github.com/AEtherSurfer/NYCSubwayEntrances

發現從http://www.mta.info/developers/sbwy_entrance.html

+0

輸入文件似乎按照「Division,Line,Station_Name」排序。這是你願意依靠的假設嗎?它可以加速處理並減少內存需求(儘管該文件看起來不是很大)。 – huynhjl 2013-03-19 14:41:35

+0

是的,我願意假定csv按'Division,Line,Station_Name'排序。 – Gabriel 2013-03-19 16:29:14

回答

1

我有以下片段獲得StationEntrances.csv。這第一個解決方案使用groupBy來分組相關站點的入口。它不假定行是排序的。儘管它只讀取一次文件,但它確實會傳遞3次(一次是在內存中讀取所有內容,一次是爲groupBy,另一次爲創建站點)。最後查看提取器的代碼。

val stations = { 
    val file = new java.io.File("StationEntrances.csv") 
    val reader = com.github.tototoshi.csv.CSVReader.open(file) 
    val byStation = reader 
    .all  // read all in memory 
    .drop(1) // drop header 
    .groupBy { 
     case List(division, line, station, _*) => (division, line, station) 
    } 
    reader.close 
    byStation.values.toList map { rows => 
    val entrances = rows map { case Row(_, _, _, _, entrance) => entrance } 
    rows.head match { 
     case Row(division, line, station, routes, _) => 
     Station(
      division, line, station, 
      routes.toList.filter(_ != ""), 
      entrances) 
    } 
    } 
} 

該解決方案假定行被排序並且應該更快,因爲它只進行一次傳遞並在讀取文件時生成結果列表。

val stations2 = { 
    import collection.mutable.ListBuffer 
    def processByChunk(iter: Iterator[Seq[String]], acc: ListBuffer[Station]) 
      : List[Station] = { 
    if (!iter.hasNext) acc.toList 
    else { 
     val head = iter.next 
     val marker = head.take(3) 
     val (rows, rest) = iter.span(_ startsWith marker) 
     val entrances = (head :: rows.toList) map { 
     case Row(_, _, _, _, entrance) => entrance 
     } 
     val station = head match { 
     case Row(division, line, station, routes, _) => 
      Station(
      division, line, station, 
      routes.toList.filter(_ != ""), 
      entrances) 
     } 
     processByChunk(rest, acc += station) 
    } 
    } 
    val file = new java.io.File("StationEntrances.csv") 
    val reader = com.github.tototoshi.csv.CSVReader.open(file) 
    val stations = processByChunk(reader.iterator.drop(1), ListBuffer()) 
    reader.close 
    stations 
}   

我已經創建了一個專用的提取器來獲取給定線路的路線/入口。我認爲它使得代碼更具可讀性,但是如果您正在處理列表,則調用fields(0)fields(25)並不是最佳的,因爲每次調用都必須遍歷列表。提取器避免了這一點。對於大多數Java csv解析器,您通常會得到Array[String],所以這通常不是問題。最後,csv解析通常不會返回空字符串,因此您可能需要使用if (adaNotes == "") None else Some(adaNotes)而不是Option(adaNotes)

object Row { 
    def unapply(s: Seq[String]) = s match { 
    case List(division, line, station, rest @ _*) => 
     val (routes, 
     List(ada, adaNotes, freeCrossover, entranceType, 
      entry, exitOnly, entranceStaffing, northSouthStreet, eastWestStreet, 
      corner, latitude, longitude)) = rest splitAt 11 // 11 routes 
     Some((
     division, line, station, 
     routes, 
     Entrance(
      ada.toBoolean, Option(adaNotes), 
      freeCrossover.toBoolean, entranceType, 
      entry == "YES", exitOnly == "YES", 
      entranceStaffing, northSouthStreet, eastWestStreet, corner, 
      latitude.toInt, longitude.toInt))) 
    case _ => None 
    } 
}