2010-10-09 116 views
5

我試圖通過XPath和箭頭在同一時間通過HXT我的方式,我完全卡在如何思考這個問題。我有下面的HTML:Haskell HXT提取值列表

<div> 
<div class="c1">a</div> 
<div class="c2">b</div> 
<div class="c3">123</div> 
<div class="c4">234</div> 
</div> 

我已經提取到一個HXT XmlTree。我想要做的就是定義一個函數(我認爲):

getValues :: [String] -> IOSArrow Xmltree [(String, String)] 

其中,如果用作getValues ["c1", "c2", "c3", "c4"],會讓我:

[("c1", "a"), ("c2", "b"), ("c3", "123"), ("c4", "234")] 

幫助嗎?

回答

2

這裏有一個方法(我的類型是有點更普遍的,我不使用XPath):

{-# LANGUAGE Arrows #-} 
module Main where 

import qualified Data.Map as M 
import Text.XML.HXT.Arrow 

classes :: (ArrowXml a) => a XmlTree (M.Map String String) 
classes = listA (divs >>> divs >>> pairs) >>> arr M.fromList 
    where 
    divs = getChildren >>> hasName "div" 
    pairs = proc div -> do 
     cls <- getAttrValue "class" -< div 
     val <- deep getText   -< div 
     returnA -< (cls, val) 

getValues :: (ArrowXml a) => [String] -> a XmlTree [(String, Maybe String)] 
getValues cs = classes >>> arr (zip cs . lookupValues cs) 
    where lookupValues cs m = map (flip M.lookup m) cs 

main = do 
    let xml = "<div><div class='c1'>a</div><div class='c2'>b</div>\ 
      \<div class='c3'>123</div><div class='c4'>234</div></div>" 

    print =<< runX (readString [] xml >>> getValues ["c1", "c2", "c3", "c4"]) 

,我可能會運行一個箭頭拿到地圖,然後做查詢,但這樣一來也適用。


要回答你的問題有關listAdivs >>> divs >>> pairsa XmlTree (String, String)型-i.e.列表箭頭,這是一個非確定性的計算是將XML樹並返回字符串對。

arr M.fromList有類型a [(String, String)] (M.Map String String)。這意味着我們不能只用divs >>> divs >>> pairs來編寫它,因爲類型不匹配。

listA解決了這個問題:它崩潰divs >>> divs >>> pairs成確定性版本a XmlTree [(String, String)]類型,這正是我們需要的。

+0

listA是做什麼的? – Muchin 2010-10-10 00:37:55

0

這是一種使用HandsomeSoup做到這一點:

-- For the join function. 
import Data.String.Utils 
import Text.HandsomeSoup 
import Text.XML.HXT.Core 

-- Of each element, get class attribute and text. 
getItem = (this ! "class" &&& (this /> getText)) 
getItems selectors = css (join "," selectors) >>> getItem 

main = do 
    let selectors = [".c1", ".c2", ".c3", ".c4"] 
    items <- runX (readDocument [] "data.html" >>> getItems selectors) 
    print items 

data.html是HTML文件。