我試圖通過haskell讀取一個大的csv文件,並生成每列的字數。如何閱讀大的CSV文件?
這超過了4M行的文件。
所以我選擇讀取一個塊並獲得每次字數(5k行一塊)。 而不是總結在一起。
當我用12000行和120000行測試函數時,時間增加幾乎是線性的。 但是,當讀取180000行時,運行時間超過四次以上。
我認爲這是因爲內存不夠,與磁盤交換使功能慢得多。
我把我的代碼寫成map/reduce樣式,但是如何讓haskell不把所有的數據保存在內存中呢?
這次打擊是我的代碼和分析結果。
import Data.Ord
import Text.CSV.Lazy.String
import Data.List
import System.IO
import Data.Function (on)
import System.Environment
splitLength = 5000
mySplit' [] = []
mySplit' xs = [x] ++ mySplit' t
where
x = take splitLength xs
t = drop splitLength xs
getBlockCount::Ord a => [[a]] -> [[(a,Int)]]
getBlockCount t = map
(map (\x -> ((head x),length x))) $
map group $ map sort $ transpose t
foldData::Ord a=> [(a,Int)]->[(a,Int)]->[(a,Int)]
foldData lxs rxs = map combind wlist
where
wlist = groupBy ((==) `on` fst) $ sortBy (comparing fst) $ lxs ++ rxs
combind xs
| 1==(length xs) = head xs
| 2 ==(length xs) = (((fst . head) xs), ((snd . head) xs)+((snd . last) xs))
loadTestData datalen = do
testFile <- readFile "data/test_csv"
let cfile = fromCSVTable $ csvTable $ parseCSV testFile
let column = head cfile
let body = take datalen $ tail cfile
let countData = foldl1' (zipWith foldData) $ map getBlockCount $ mySplit' body
let output = zip column $ map (reverse . sortBy (comparing snd)) countData
appendFile "testdata" $ foldl1 (\x y -> x ++"\n"++y)$ map show $tail output
main = do
s<-getArgs
loadTestData $ read $ last s
剖析結果
loadData +RTS -p -RTS 12000
total time = 1.02 secs (1025 ticks @ 1000 us, 1 processor)
total alloc = 991,266,560 bytes (excludes profiling overheads)
loadData +RTS -p -RTS 120000
total time = 17.28 secs (17284 ticks @ 1000 us, 1 processor)
total alloc = 9,202,259,064 bytes (excludes profiling overheads)
loadData +RTS -p -RTS 180000
total time = 85.06 secs (85059 ticks @ 1000 us, 1 processor)
total alloc = 13,760,818,848 bytes (excludes profiling overheads)
您需要使用流式庫,例如'csv-conduit'或'pipes-csv' – ErikR 2014-11-03 02:43:54