那麼,我們如何快速可以讓我們做一些測試與@ JA生成的文件的代碼:
cat data.txt > /dev/null
--> 0.17 seconds
的同樣在Haskell?
import qualified Data.ByteString as B
f = id
main = B.readFile "data.txt" >>= return . f >>= B.putStr
時間?
time ./Test > /dev/null
--> 0.32 seconds
花了兩倍的時間,但我想這不是太糟糕。使用嚴格的字節串,因爲 我們想要在一秒鐘內完成。
接下來,我們可以使用Vector
還是它太慢?我們建立一個大塊Vector
並將它們重新放在一起。我使用blaze-builder
包進行優化輸出。
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as L
import qualified Data.Vector as V
import qualified Blaze.ByteString.Builder as BB
import Data.Monoid
recordLen = 36
lineEndingLen = 2 -- Windows! change to 1 for Unix
numRecords = (`div` (recordLen + lineEndingLen)) . B.length
substr idx len = B.take len . B.drop idx
recordByIdx idx = substr (idx*(recordLen+lineEndingLen)) recordLen
mkVector :: B.ByteString -> V.Vector (B.ByteString)
mkVector bs = V.generate (numRecords bs) (\i -> recordByIdx i bs)
mkBS :: V.Vector (B.ByteString) -> L.ByteString
mkBS = BB.toLazyByteString . V.foldr foldToBS mempty
where foldToBS :: B.ByteString -> BB.Builder -> BB.Builder
foldToBS = mappend . BB.fromWrite . BB.writeByteString
main = B.readFile "data.txt" >>= return . mkBS . mkVector >>= L.putStr
需要多長時間?
time ./Test2 > /dev/null
--> 1.06 seconds
一點都不差!即使您使用正則表達式來讀取行而不是固定塊位置,我們仍然可以得出結論,您可以將塊放入Vector
,但不會有嚴重的性能問題。
還剩什麼?排序。從理論上講,桶類應該是這類問題的理想算法。我試着自己實現一個:
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as L
import qualified Data.Vector as V
import qualified Data.Vector.Generic.Mutable as MV
import qualified Blaze.ByteString.Builder as BB
import Data.Monoid
import Control.Monad.ST
import Control.Monad.Primitive
recordLen = 36
lineEndingLen = 2 -- Windows! change to 1 for Unix
numRecords = (`div` (recordLen + lineEndingLen)) . B.length
substr idx len = B.take len . B.drop idx
recordByIdx idx = substr (idx*(recordLen+lineEndingLen)) (recordLen+lineEndingLen)
mkVector :: B.ByteString -> V.Vector (B.ByteString)
mkVector bs = V.generate (numRecords bs) (\i -> recordByIdx i bs)
mkBS :: V.Vector (B.ByteString) -> L.ByteString
mkBS = BB.toLazyByteString . V.foldr foldToBS mempty
where foldToBS :: B.ByteString -> BB.Builder -> BB.Builder
foldToBS = mappend . BB.fromWrite . BB.writeByteString
bucketSort :: Int -> V.Vector B.ByteString -> V.Vector B.ByteString
bucketSort chunkSize v = runST $ emptyBuckets >>= \bs ->
go v bs lastIdx (chunkSize - 1)
where lastIdx = V.length v - 1
emptyBuckets :: ST s (V.MVector (PrimState (ST s)) [B.ByteString])
emptyBuckets = V.thaw $ V.generate 256 (const [])
go :: V.Vector B.ByteString ->
V.MVector (PrimState (ST s)) [B.ByteString] ->
Int -> Int -> ST s (V.Vector B.ByteString)
go v _ _ (-1) = return v
go _ buckets (-1) testIdx = do
v' <- unbucket buckets
bs <- emptyBuckets
go v' bs lastIdx (testIdx - 1)
go v buckets idx testIdx = do
let testChunk = v V.! idx
testByte = fromIntegral $ testChunk `B.index` testIdx
b <- MV.read buckets testByte
MV.write buckets testByte (testChunk:b)
go v buckets (idx-1) testIdx
unbucket :: V.MVector (PrimState (ST s)) [B.ByteString] ->
ST s (V.Vector B.ByteString)
unbucket v = do
v' <- V.freeze v
return . V.fromList . concat . V.toList $ v'
main = B.readFile "data.txt" >>= return . process >>= L.putStr
where process = mkBS .
bucketSort (recordLen) .
mkVector
測試它給了約1分50秒的時間,這可能是可以接受的。我們正在討論一個O(c * n)算法,其中n是在幾百萬的範圍內,而一個常數c是36 *的東西。但我相信你可以進一步優化它。可以使用vector-algorithms
包。用堆排序測試:
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as L
import qualified Data.Vector as V
import qualified Blaze.ByteString.Builder as BB
import Data.Vector.Algorithms.Heap (sort)
import Data.Monoid
import Control.Monad.ST
recordLen = 36
lineEndingLen = 2 -- Windows! change to 1 for Unix
numRecords = (`div` (recordLen + lineEndingLen)) . B.length
substr idx len = B.take len . B.drop idx
recordByIdx idx = substr (idx*(recordLen+lineEndingLen)) (recordLen+lineEndingLen)
mkVector :: B.ByteString -> V.Vector (B.ByteString)
mkVector bs = V.generate (numRecords bs) (\i -> recordByIdx i bs)
mkBS :: V.Vector (B.ByteString) -> L.ByteString
mkBS = BB.toLazyByteString . V.foldr foldToBS mempty
where foldToBS :: B.ByteString -> BB.Builder -> BB.Builder
foldToBS = mappend . BB.fromWrite . BB.writeByteString
sortIt v = runST $ do
mv <- V.thaw v
sort mv
V.freeze mv
main = B.readFile "data.txt" >>= return . process >>= L.putStr
where process = mkBS .
sortIt .
mkVector
這並不在我的機器上大約1:20分鐘的工作,所以現在它比我的桶排序更快。最終的解決方案都消耗1-1.2 GB RAM的範圍。
夠好嗎?
我的確是用bytestring-trie開始的,但是在這個問題上花費了很長時間:在我殺死它之前它已經達到了大約12個小時(儘管沒有並行化)。我可以嘗試打開hashmap解決方案,看看它是如何做到的。至於第二部分:我使用hGetLine一次讀取一行。然後,我使用正則表達式包中的'match'從行中提取幾個子字符串。 –
看來hashmap版本在積累GC方面也不好;在每條線程完成了10萬行後,「生產力佔用戶總數的32.0%,已用完總數的198.0%」,在200k後「生產力佔用戶總數的11.4%,總用戶數的132.1%」。相比之下,如果我只解析(檢查它不是導致GC的解析),那麼在200k行之後,它是「總用戶的70.2%,總用戶的217.9%」。 –