2013-04-26 86 views
2

我以前使用過XML包來解析HTML和XML,並對xPath有一個基本的把握。然而,我被要求考慮XML數據,其中重要的比特由元素本身的文本和屬性以及相關節點中的組合來確定。我從來沒有這樣做過。例如根據相關節點的屬性和文本值解析XML

[更新內容,例如,稍微膨脹]

<Catalogue> 
<Bookstore id="ID910705541"> 
    <location>foo bar</location> 
    <books> 
    <book category="A" id="1"> 
     <title>Alpha</title> 
     <author ref="1">Matthew</author> 
     <author>Mark</author> 
     <author>Luke</author> 
     <author ref="2">John</author> 
     <year>2005</year> 
     <price>29.99</price> 
    </book> 
    <book category="B" id="10"> 
     <title>Beta</title> 
     <author ref="1">Huey</author> 
     <author>Duey</author> 
     <author>Louie</author> 
     <year>2005</year> 
     <price>29.99</price> 
    </book> 
    <book category="D" id="100"> 
     <title>Gamma</title> 
     <author ref="1">Tweedle Dee</author> 
     <author ref="2">Tweedle Dum</author> 
     <year>2005</year> 
     <price>29.99</price> 
    </book> 
    </books> 
    </Bookstore> 
<Bookstore id="ID910700051"> 
    <location>foo</location> 
    <books> 
    <book category="A" id="1"> 
     <title>Happy</title> 
     <author>Dopey</author> 
     <author>Bashful</author> 
     <author>Doc</author> 
     <author ref="1">Grumpy</author> 
     <year>2005</year> 
     <price>29.99</price> 
    </book> 
    <book category="B" id="10"> 
     <title>Ni</title> 
     <author ref="1">John</author> 
     <author ref="2">Paul</author> 
     <author ref="3">George</author> 
     <year>2005</year> 
     <price>29.99</price> 
    </book> 
    <book category="D" id="100"> 
     <title>San</title> 
     <author ref="1">Ringo</author> 
     <year>2005</year> 
     <price>29.99</price> 
    </book> 
    </books> 
</Bookstore> 
<Bookstore id="ID910715717"> 
    <location>bar</location> 
    <books> 
    <book category="A" id="1"> 
     <title>Un</title> 
     <author ref="1">Winkin</author> 
     <author>Blinkin</author> 
     <year>2005</year> 
     <price>29.99</price> 
    </book> 
    <book category="B" id="10"> 
     <title>Deux</title> 
     <author>Nod</author> 
     <year>2005</year> 
     <price>29.99</price> 
    </book> 
    <book category="D" id="100"> 
     <title>Trois</title> 
     <author>Manny</author> 
     <author>Moe</author> 
     <year>2005</year> 
     <price>29.99</price> 
    </book> 
    </books> 
</Bookstore> 
</Catalogue> 

我想提取所有作者姓名: 1)位置的元素有一個包含「NY」 2文本值)作者元素不包含「ref」屬性;這是作者標記中不存在ref的地方

我最終需要將提取的作者連接在一個給定的書店中,以便我的結果數據框爲每個商店一行。我想保留書店ID作爲數據框中的附加字段,以便我可以單獨參考每個商店。 由於只有第一bokstore是在紐約,從這個簡單的例子,結果看起來是這樣的:

1 Jane Smith John Doe Karl Pearson William Gosset 

如果另一個書店在其位置載「NY」,這將包括在第二行,依此類推。

在這些複雜的條件下,我是否要求太多的R來解析?

回答

3
require(XML) 

xdata <- xmlParse(apptext) 
xpathSApply(xdata,'//*/location[text()[contains(.,"NY")]]/following-sibling::books/.//author[not(@ref)]') 
#[[1]] 
#<author>Jane Smith</author> 

#[[2]] 
#<author>John Doe</author> 

#[[3]] 
#<author>Karl Pearson</author> 

#[[4]] 
#<author>William Gosset</author> 

擊穿:

獲取包含 'NY'

//*/location[text()[contains(.,"NY")]] 

獲取圖書,這些節點的兄弟

/following-sibling::books 
從這些筆記

得到所有作者沒有裁判的所有位置屬性

/.//author[not(@ref)] 

使用xmlValue,如果你想要的文字:

> xpathSApply(xdata,'//*/location[text()[contains(.,"NY")]]/following-sibling::books/.//author[not(@ref)]',xmlValue) 
[1] "Jane Smith"  "John Doe"  "Karl Pearson" "William Gosset" 

UPDATE:

child.nodes <- xpathSApply(xdata,'//*/location[text()[contains(.,"NY")]]/following-sibling::books/.//author[not(@ref)]') 

ans.func<-function(x){ 
    xpathSApply(x,'.//ancestor::bookstore[@id]/@id') 
} 

sapply(child.nodes,ans.func) 
# id id id id 
#"1" "1" "1" "1" 

更新2:

與您更改的數據

xdata <- '<Catalogue> 
<Bookstore id="ID910705541"> 
    <location>foo bar</location> 
    <books> 
    <book category="A" id="1"> 
     <title>Alpha</title> 
     <author ref="1">Matthew</author> 
     <author>Mark</author> 
     <author>Luke</author> 
     <author ref="2">John</author> 
     <year>2005</year> 
     <price>29.99</price> 
    </book> 
    <book category="B" id="10"> 
     <title>Beta</title> 
     <author ref="1">Huey</author> 
     <author>Duey</author> 
     <author>Louie</author> 
     <year>2005</year> 
     <price>29.99</price> 
    </book> 
    <book category="D" id="100"> 
     <title>Gamma</title> 
     <author ref="1">Tweedle Dee</author> 
     <author ref="2">Tweedle Dum</author> 
     <year>2005</year> 
     <price>29.99</price> 
    </book> 
    </books> 
    </Bookstore> 
<Bookstore id="ID910700051"> 
    <location>foo</location> 
    <books> 
    <book category="A" id="1"> 
     <title>Happy</title> 
     <author>Dopey</author> 
     <author>Bashful</author> 
     <author>Doc</author> 
     <author ref="1">Grumpy</author> 
     <year>2005</year> 
     <price>29.99</price> 
    </book> 
    <book category="B" id="10"> 
     <title>Ni</title> 
     <author ref="1">John</author> 
     <author ref="2">Paul</author> 
     <author ref="3">George</author> 
     <year>2005</year> 
     <price>29.99</price> 
    </book> 
    <book category="D" id="100"> 
     <title>San</title> 
     <author ref="1">Ringo</author> 
     <year>2005</year> 
     <price>29.99</price> 
    </book> 
    </books> 
</Bookstore> 
<Bookstore id="ID910715717"> 
    <location>bar</location> 
    <books> 
    <book category="A" id="1"> 
     <title>Un</title> 
     <author ref="1">Winkin</author> 
     <author>Blinkin</author> 
     <year>2005</year> 
     <price>29.99</price> 
    </book> 
    <book category="B" id="10"> 
     <title>Deux</title> 
     <author>Nod</author> 
     <year>2005</year> 
     <price>29.99</price> 
    </book> 
    <book category="D" id="100"> 
     <title>Trois</title> 
     <author>Manny</author> 
     <author>Moe</author> 
     <year>2005</year> 
     <price>29.99</price> 
    </book> 
    </books> 
</Bookstore> 
</Catalogue>' 

注以前你有bookstore now BookstoreNY不見了,所以我用foo

require(XML) 
xdata <- xmlParse(xdata) 
child.nodes <- getNodeSet(xdata,'//*/location[text()[contains(.,"foo")]]/following-sibling::books/.//author[not(@ref)]') 

ans.func<-function(x){ 
    xpathSApply(x,'.//ancestor::Bookstore[@id]/@id') 
} 

sapply(child.nodes,ans.func) 
#   id   id   id   id   id 
#"ID910705541" "ID910705541" "ID910705541" "ID910705541" "ID910700051" 
#   id   id 
#"ID910700051" "ID910700051" 

xpathSApply(xdata,'//*/location[text()[contains(.,"foo")]]/following-sibling::books/.//author[not(@ref)]',xmlValue) 
# [1] "Mark" "Luke" "Duey" "Louie" "Dopey" "Bashful" "Doc"  
+0

哦,這很棒,我已經適應了我的數據。我得到了一個作者姓名向量,然後我可以連接。問題是,我只需要連接那些與它們各自的書店ID相關聯的連接。我以爲我可以將xPath放回來獲取祖先:: bookstore,然後將@id返回給另一個向量。但是這樣做會爲每個書店返回一個ID,而不是每個資格圖書每個書店都有一個ID。我期望後者能夠返回一個與包含作者姓名的矢量長度相同的矢量。有什麼建議? – 2013-04-26 15:21:39

+0

如果沒有具體的例子,有點難以評論。你也許可以讓每個節點滿足你的條件,然後回到它的祖先(書店)並找回這個ID。我舉了一個例子,也許它會與你的完整數據一起工作。 – user1609452 2013-04-26 15:38:28

+0

隨着您更新的數據,我已經包含了一個例子。 – user1609452 2013-04-26 19:40:20