查找對內存來說太大的列表的大小？

這裏是全新的Haskell程序員。剛剛完成「瞭解你一個Haskell」...我對一個有多少特定屬性的集合感興趣。我有一些小的參數值的工作代碼，但我想知道如何處理更大的結構。我知道Haskell可以做「無限的數據結構」，但我只是沒有看到如何從那裏走到那裏，並學習你一個Haskell /谷歌並沒有讓我這樣做。查找對內存來說太大的列表的大小？

下面是我的eSet給「小」參數的工作代碼r和t

import Control.Monad 
import System.Environment 
import System.Exit 

myPred :: [Int] -> Bool 
myPred a = myPred' [] a 
    where 
     myPred' [] []   = False 
     myPred' [] [0]  = True 
     myPred' _ []   = True 
     myPred' acc (0:aTail) = myPred' acc aTail 
     myPred' acc (a:aTail) 
      | a `elem` acc = False 
      | otherwise  = myPred' (a:acc) aTail 

superSet :: Int -> Int -> [[Int]] 
superSet r t = replicateM r [0..t] 

eSet :: Int -> Int -> [[Int]] 
eSet r t = filter myPred $ superSet r t 

main :: IO() 
main = do 
    args <- getArgs 
    case args of 
     [rArg, tArg] -> 
      print $ length $ eSet (read rArg) (read tArg) 
     [rArg, tArg, "set"] -> 
      print $   eSet (read rArg) (read tArg) 
     _ -> 
      die "Usage: eSet r r set <set optional for printing set itself otherwise just print the size

當編譯/運行，我得到

$ ghc eSet.hs -rtsopts 
[1 of 1] Compiling Main    (eSet.hs, eSet.o) 
Linking eSet ... 
$ # Here's is a tiny eSet to illustrate. It is the set of lists of r integers from zero to t with no repeated nonzero list entries 
$ ./eSet 4 2 set 
[[0,0,0,0],[0,0,0,1],[0,0,0,2],[0,0,1,0],[0,0,1,2],[0,0,2,0],[0,0,2,1],[0,1,0,0],[0,1,0,2],[0,1,2,0],[0,2,0,0],[0,2,0,1],[0,2,1,0],[1,0,0,0],[1,0,0,2],[1,0,2,0],[1,2,0,0],[2,0,0,0],[2,0,0,1],[2,0,1,0],[2,1,0,0]] 
$ ./eSet 8 4 +RTS -sstderr 
3393 
    174,406,136 bytes allocated in the heap 
     29,061,152 bytes copied during GC 
     4,382,568 bytes maximum residency (7 sample(s)) 
     148,664 bytes maximum slop 
       14 MB total memory in use (0 MB lost due to fragmentation) 

            Tot time (elapsed) Avg pause Max pause 
    Gen 0  328 colls,  0 par 0.047s 0.047s  0.0001s 0.0009s 
    Gen 1   7 colls,  0 par 0.055s 0.055s  0.0079s 0.0147s 

    INIT time 0.000s ( 0.000s elapsed) 
    MUT  time 0.298s ( 0.301s elapsed) 
    GC  time 0.102s ( 0.102s elapsed) 
    EXIT time 0.001s ( 0.001s elapsed) 
    Total time 0.406s ( 0.405s elapsed) 

    %GC  time  25.1% (25.2% elapsed) 

    Alloc rate 585,308,888 bytes per MUT second 

    Productivity 74.8% of total user, 75.0% of total elapsed 

$ ./eSet 10 5 +RTS -sstderr 
63591 
    27,478,010,744 bytes allocated in the heap 
    4,638,903,384 bytes copied during GC 
    532,163,096 bytes maximum residency (15 sample(s)) 
     16,500,072 bytes maximum slop 
      1556 MB total memory in use (0 MB lost due to fragmentation) 

            Tot time (elapsed) Avg pause Max pause 
    Gen 0  52656 colls,  0 par 6.865s 6.864s  0.0001s 0.0055s 
    Gen 1  15 colls,  0 par 8.853s 8.997s  0.5998s 1.8617s 

    INIT time 0.000s ( 0.000s elapsed) 
    MUT  time 52.652s (52.796s elapsed) 
    GC  time 15.717s (15.861s elapsed) 
    EXIT time 0.193s ( 0.211s elapsed) 
    Total time 68.564s (68.868s elapsed) 

    %GC  time  22.9% (23.0% elapsed) 

    Alloc rate 521,883,277 bytes per MUT second 

    Productivity 77.1% of total user, 76.7% of total elapsed

我看到我的內存使用率是非常高的，有一個很多垃圾收集。運行時eSet 12 6我得到一個分段錯誤。

我覺得filter myPred $ superSet r t讓我不再懶惰地讓子集一次一部分，所以我可以處理更大（但有限）的集合，但我不知道如何改變爲另一種方法來做到這一點。我認爲這是我的問題的根源。另外，因爲這是我的第一個Haskell程序，所以關於樣式以及如何實現「pythonic」的Haskell模擬的非常感謝！

來源

2017-09-20 Kevin Johnson

也許在這裏並不重要，但在測量任何性能之前，請記住使用'-O2'或類似的東西開啓優化編譯。有時它會產生巨大的差異。 – chi

如果你只關心有多少，那麼一點組合應該讓你想要更便宜。如果聽起來很有意思，我很樂意再寫一個關於這個問題的答案。 –

@DanielWagner有點組合是最受歡迎的，也是一個很好的方法來幫助確認程序的正確性！請寫下來。 –

一個完全不同的方法是隻做一點組合分析。下面是eSet r t做，儘可能接近我可以做出來的過程：無需更換

選擇最多r元素從一組大小t的。
通過交錯標記值將序列填充到長度爲r。

因此，讓我們只是計算每個步驟的方法，而不是實際執行它們。我們將介紹一個新參數s，它是步驟（1）產生的序列的長度（因此我們知道它具有s <= r和s <= t）。當從t大小的集合中繪製元素時，有多少個大小爲s的序列？那麼，有t選擇的第一要素，t-1選擇第二個元素，t-2選擇了第三個元素，...

-- sequencesWithoutReplacement is a very long name! 
seqWORepSize :: Integer -> Integer -> Integer 
seqWORepSize s t = product [t-s+1 .. t]

然後，我們需要填充序列出來的r的長度。我們將在r中選擇s個位置 - 從我們的序列中抽取長序列，剩下的將是哨兵。有多少種方法可以做到這一點？這是一個着名的組合運算符choose。

choose :: Integer -> Integer -> Integer 
choose r s = product [r-s+1 .. r] `div` product [2 .. s]

的方式來產生一個給定長度的填充序列號是這兩個數字的不僅僅是產品，因爲「插入什麼樣的價值觀」和「在何處插入值」的選擇可以完全由獨立。

paddedSeqSize :: Integer -> Integer -> Integer -> Integer 
paddedSeqSize r s t = seqWORepSize s t * (r `choose` s)

現在我們已經完成了很多工作。只需遍歷所有可能的序列長度並加起來就可以得到paddedSeqSize。

eSetSize :: Integer -> Integer -> Integer 
eSetSize r t = sum $ map (\s -> paddedSeqSize r s t) [0..r]

我們可以試一下在ghci中：

> :set +s 
> map length $ [eSet 1 1, eSet 4 4, eSet 6 4, eSet 4 6] 
[2,209,1045,1045] 
(0.13 secs, 26,924,264 bytes) 
> [eSetSize 1 1, eSetSize 4 4, eSetSize 6 4, eSetSize 4 6] 
[2,209,1045,1045] 
(0.01 secs, 120,272 bytes)

這種方式是顯著更快，顯著更多的內存友好。事實上，我們可以查詢並獲得有關eSet的答案，我們將永遠無法統計一個接一個的大小，例如，

> length . show $ eSetSize 1000 1000 
2594 
(0.26 secs, 909,746,448 bytes)

祝你好運數10^2594一次一個。 = P

編輯
這種思想也可以適用於生產填充的序列本身，而不是僅僅計算有多少。首先，一個方便的工具，我發現自己定義一遍又一遍地對挑選出的列表的各個元素和報告的剩菜：

zippers :: [a] -> [([a], a, [a])] 
zippers = go [] where 
    go ls [] = [] 
    go ls (h:rs) = (ls, h, rs) : go (h:ls) rs

現在，無需更換序列可以通過反覆從選擇單個元素來完成剩菜。

seqWORep :: Int -> [a] -> [[a]] 
seqWORep 0 _ = [[]] 
seqWORep n xs = do 
    (ls, y, rs) <- zippers xs 
    ys <- seqWORep (n-1) (ls++rs) 
    return (y:ys)

一旦我們有一個順序，我們可以通過生產定點值的所有的交錯如下它墊至所需尺寸：

interleavings :: Int -> a -> [a] -> [[a]] 
interleavings 0 _ xs = [xs] 
interleavings n z [] = [replicate n z] 
interleavings n z [email protected](x:xt) = map (z:) (interleavings (n-1) z xs) 
          ++ map (x:) (interleavings n z xt)

最後，頂級的功能只是代表到seqWORep和interleavings。

eSet :: Int -> Int -> [[Int]] 
eSet r t = do 
    s <- [0..r] 
    xs <- seqWORep s [1..t] 
    interleavings (r-s) 0 xs

在我的測試這個修改eSet具有既沒有優化好的平面內存使用量;不會生成任何需要稍後filter的虛假元素，因此比您的原始提案更快;對於依賴誘惑GHC的答案來說，對我而言，這看起來相當自然。一個很好的屬性集合！

來源

2017-09-21 14:44:57

令人印象深刻。這個eSet 10 5運行時間爲0.527s，而我的解決方案的運行時間爲0.766s： eSet rt = return [] >> = foldr（<= <）return（replicate r（branch t））它具有非常好的屬性（沒有過濾，與-O2一起工作，並且速度很快）。優秀的領域特定技術應用。謝謝！做得好！ –

我懷疑這裏的罪魁禍首是replicateM，其中有this implementation：

replicateM cnt0 f = 
    loop cnt0 
    where 
    loop cnt 
     | cnt <= 0 = pure [] 
     | otherwise = liftA2 (:) f (loop (cnt - 1))

問題行是liftA2 (:) f (loop (cnt - 1));可能loop (cnt - 1)正在部分應用於f元素的所有(:)調用中共享，因此loop (cnt - 1)必須保留在內存中。不幸loop (cnt - 1)是一個相當長的名單...

它可以有點fiddly說服GHC 不是分享一些東西。以下重新定義superSet給了我一個很好的平面內存使用情況;當然，對於適合內存的例子來說，它可能會稍微慢一些。關鍵的想法是讓它看起來像未經訓練的眼睛（即GHC），就像遞歸單子動作取決於先前做出的選擇一樣，即使它沒有。

superSet :: Int -> Int -> [[Int]] 
superSet r t = go r 0 where 
    go 0 ignored = if ignored == 0 then [[]] else [[]] 
    go r ignored = do 
     x <- [0..t] 
     xs <- go (r-1) (ignored+x) 
     return (x:xs)

如果你不介意的話，避免優化，更自然的定義也適用：

superSet 0 t = [[]] 
superSet r t = do 
    x <- [0..t] 
    xs <- superSet (r-1) t 
    return (x:xs)

...但-O2 GHC是太聰明，注意到它可以共享的遞歸調用。

來源

2017-09-20 22:50:19

被忽略的+ h'應該被忽略+1嗎？ –

@ K.A.Buhr哎呀，它應該是'ignored + x'。我的測試文件使用'h'和't'來代替'x'和'xs'，但是我發現't'的重載在進行寫操作時有點混亂，並且沒有完全改變名字。修復傳入;感謝您的注意！ –

重新閱讀LYaH的部分內容後，思考@ daniel-wagners的單調組合答案聽起來像是值得再試一次。新的代碼總內存是平坦的，並且可以使用和不使用-O2優化。

來源：

import Control.Monad 
import System.Environment 
import System.Exit 

allowed :: [Int] -> Bool 
allowed a = allowed' [] a 
    where 
     allowed' [ ] [ ]  = False 
     allowed' [ ] [0]  = True 
     allowed' _ [ ]  = True 
     allowed' acc (0:aTail) = allowed' acc aTail 
     allowed' acc (a:aTail) 
      | a `elem` acc = False 
      | otherwise  = allowed' (a:acc) aTail 

branch :: Int -> [Int] -> [[Int]] 
branch t x = filter allowed [n:x | n <- [0..t]] 

eSet :: Int -> Int -> [[Int]] 
eSet r t = return [] >>= foldr (<=<) return (replicate r (branch t)) 

main :: IO() 
main = do 
    args <- getArgs 
    case args of 
     [rArg, tArg] -> 
      print $ length $ eSet (read rArg) (read tArg) 
     [rArg, tArg, "set"] -> 
      print $   eSet (read rArg) (read tArg) 
     _ -> die "Usage: eSet r r set <set optional>"

與一元函數組合測試的版本速度更快，沒有內存問題...

$ ./eSetMonad 10 5 +RTS -sstderr 
63591 
289,726,000 bytes allocated in the heap 
    997,968 bytes copied during GC 
     63,360 bytes maximum residency (2 sample(s)) 
     24,704 bytes maximum slop 
      2 MB total memory in use (0 MB lost due to fragmentation) 

           Tot time (elapsed) Avg pause Max pause 
Gen 0  553 colls,  0 par 0.008s 0.008s  0.0000s 0.0002s 
Gen 1   2 colls,  0 par 0.000s 0.000s  0.0002s 0.0003s 

INIT time 0.000s ( 0.000s elapsed) 
MUT  time 0.426s ( 0.429s elapsed) 
GC  time 0.009s ( 0.009s elapsed) 
EXIT time 0.000s ( 0.000s elapsed) 
Total time 0.439s ( 0.438s elapsed) 

%GC  time  2.0% (2.0% elapsed) 

Alloc rate 680,079,893 bytes per MUT second 

Productivity 98.0% of total user, 98.3% of total elapsed

來源

2017-09-21 04:21:07

查找對內存來說太大的列表的大小？

回答

相關問題