2012-02-10 56 views
4

我想知道解析多個解析器可以成功的輸入的最佳方式。我已經概述了我第一次失敗的嘗試和一個我希望可以變得更加習慣的不雅的解決方案。在單個輸入中選擇多個正確的解析器

比如我想萊克斯「號的」,「快速」和「狐狸」從下面的句子翻譯成他們自己的數據構造:

"the quick brown fox jumps over the lazy dog". 

因此,考慮以下類型的構造函數:

data InterestingWord = Quick | The | Fox deriving Show 
data Snippet = Word InterestingWord | Rest String deriving Show 

我想解析的輸出是:

[Word The, 
Rest " ", Word Quick, 
Rest " brown ", Word Fox, 
Rest " jumped over ", Word The, 
Rest " lazy dog"] 

這裏有兩種解決方案:

import Text.Parsec 
import Data.Maybe 
import Data.Ord  
import Data.List 

data InterestingWord = Quick | The | Fox deriving Show 
data Snippet = Word InterestingWord | Rest String deriving Show 

testCase = "the quick brown fox jumped over the lazy dog" 
-- Expected output: 
-- [Word The, 
-- Rest " ", Word Quick, 
-- Rest " brown ", Word Fox, 
-- Rest " jumped over ", Word The, 
-- Rest " lazy dog"] 

toString Quick = "quick" 
toString The = "the" 
toString Fox = "fox" 

-- First attempt 

-- Return characters upto the intended word along 
-- with the word itself 
upto word = do 
    pre <- manyTill anyChar $ lookAhead $ string (toString word) 
    word' <- try $ string (toString word) 
    return [Rest pre, Word word] 

-- Parsers for the interesting words 
parsers = [upto Quick, 
      upto The, 
      upto Fox] 

-- Try each parser and return its results with the 
-- rest of the input. 
-- An incorrect result is produced because "choice" 
-- picks the first successful parse result. 
wordParser = do 
    snippets <- many $ try $ choice parsers 
    leftOver <- many anyChar 
    return $ concat $ snippets ++ [[Rest leftOver]] 

-- [Rest "the ",Word Quick,Rest " brown fox jumped over the lazy dog"]   
test1 = parseTest wordParser testCase 

-- Correct 

-- In addition to the characters leading upto the 
-- word and the word, the position is also returned 
upto' word = do 
    result <- upto word 
    pos <- getPosition 
    return (pos, result) 

-- The new parsers   
parsers' = [upto' Quick, 
      upto' The, 
      upto' Fox] 

-- Try each of the given parsers and 
-- possibly returning the results and 
-- the parser but don't consume 
-- input. 
tryAll = mapM (\p -> do 
       r <- optionMaybe $ try (lookAhead p) 
       case r of 
        Just result -> return $ Just (p, result) 
        Nothing -> return $ Nothing 
      ) 

-- Pick the parser that has consumed the least.    
firstSuccess ps = do 
    successes <- tryAll ps >>= return . catMaybes 
    if not (null successes) then 
     return $ Just (fst $ head (sortBy (comparing (\(_,(pos,_)) -> pos)) successes)) 
    else return $ Nothing 

-- Return the parse results for the parser that 
-- has consumed the least 
wordParser' = do 
    parser <- firstSuccess parsers' 
    case parser of 
    Just p -> do 
     (_,snippet) <- p 
     return snippet 
    Nothing -> parserZero 

-- Returns the right result 
test2 = parseTest (many wordParser' >>= return . concat) testCase 

第一次嘗試「測試1」,因爲「選擇」返回成功,當我真正想要的是同時消耗最少的字符成功的第一個解析器第一分析器不產生所需的輸出。這是我接下來的嘗試,通過保持輸入被解析後的源位置,並使用源位置最低的解析器。

這種情況似乎很普遍,我覺得我錯過了一些明顯的combinator咒語。誰能提供更好的建議?

謝謝!

-deech

+0

作爲一般觀點 - 我不會急於使用Parsec進行NLP解析,它實際上是解析編程語言和結構化文本格式的工具。正在進行的Haskell NLP書似乎直接使用Prelude的「單詞」和列表功能 - http://nlpwp.org/book/ – 2012-02-10 18:41:53

回答

1

這是不是一個特別普遍的需求,但在這裏是一個實現:

import Control.Monad 
import "parsec3" Text.Parsec 
import Data.Maybe 
import Data.List 
import Data.Ord 

longestParse :: [Parsec String() a] -> Parsec String() a 
longestParse parsers = do 
    allParses <- sequence [lookAhead $ optionMaybe $ try $ 
    liftM2 (,) parse getPosition | parse <- parsers] 
    -- allParses :: [Maybe (a, SourcePos)] 
    (bestParse, bestPos) <- case catMaybes allParses of 
    [] -> fail "No valid parse" -- maybe we can do something better? 
    successfulParses -> return $ minimumBy (comparing snd) successfulParses 
    setPosition bestPos 
    return bestParse 
+0

有趣的是,這不是一個常見的用例,並且給出您的答案,我想不是太遙遠了。我非常喜歡你如何使用列表理解來過濾出成功的分析。謝謝! – Deech 2012-02-11 14:56:24

0

據我瞭解,你要反覆分析給你看的第一個有趣的詞。目前,您正在解析每個有趣的單詞,並檢查您找到的哪個有趣單詞更接近。

我的建議是定義一個解析器來檢查你現在是否在一個有趣的單詞(只有一個選擇可以成功,所以沒有必要做比任何簡單的選擇更有用的東西)。然後你繼續前進,直到第一個解析器成功,當碰到任何有趣的單詞時會發生這種情況。這給了你第一個有趣的單詞,因爲在它包含任何有趣的單詞之前你什麼都不知道。

是的,這並沒有回答確定哪個解析器匹配最短的問題。這避免了這個問題,通過解決實際問題而不關心哪個解析器匹配是最短的。

+0

感謝您的回答。問題是我有多個解析器可以成功。 Parsec有「嘗試(...)<|>(...)...」的成語,但像「選擇」一樣,它只挑選第一個成功的。我需要知道哪些解析器成功地使用了最少的輸入。 – Deech 2012-02-11 14:54:11

+0

@Deech見編輯 - 我想更好地解釋我自己 – Retief 2012-02-11 18:33:50