這是我怎麼會去這樣做:
require 'nokogiri'
doc = Nokogiri::XML(open('/Users/gferguson/smithsonian-events.xml'))
namespaces = doc.collect_namespaces
entries = doc.search('entry').map { |entry|
entry_title = entry.at('title').text
entry_time_start, entry_time_end = ['startTime', 'endTime'].map{ |p|
entry.at('gd|when', namespaces)[p]
}
entry_notes = entry.at('gc|notes', namespaces).text
{
title: entry_title,
start_time: entry_time_start,
end_time: entry_time_end,
notes: entry_notes
}
}
,當運行結果爲entries
是哈希陣列:
require 'awesome_print'
ap entries [0, 3]
# >> [
# >> [0] {
# >> :title => "Conservation Clinics",
# >> :start_time => "2016-11-09T14:00:00Z",
# >> :end_time => "2016-11-09T17:00:00Z",
# >> :notes => "Have questions about the condition of a painting, frame, drawing,\n print, or object that you own? Our conservators are available by\n appointment to consult with you about the preservation of your art.\n \n To request an appointment or to learn more,\n e-mail [email protected] and specify CLINIC in the subject line."
# >> },
# >> [1] {
# >> :title => "Castle Highlights Tour",
# >> :start_time => "2016-11-09T14:00:00Z",
# >> :end_time => "2016-11-09T14:45:00Z",
# >> :notes => "Did you know that the Castle is the Smithsonian’s first and oldest building? Join us as one of our dynamic volunteer docents takes you on a tour to explore the highlights of the Smithsonian Castle. Come learn about the founding and early history of the Smithsonian; its original benefactor, James Smithson; and the incredible history and architecture of the Castle. Here is your opportunity to discover the treasured stories revealed within James Smithson's crypt, the Gre...
# >> },
# >> [2] {
# >> :title => "Exhibition Interpreters/Navigators (throughout the day)",
# >> :start_time => "2016-11-09T15:00:00Z",
# >> :end_time => "2016-11-09T15:00:00Z",
# >> :notes => "Museum volunteer interpreters welcome visitors, answer questions, and help visitors navigate exhibitions. Interpreters may be stationed in several of the following exhibitions at various times throughout the day, subject to volunteer interpreter availability. <ul> \t<li><em>The David H. Koch Hall of Human Origins: What Does it Mean to be Human?</em></li> \t<li><em>The Sant Ocean Hall</em></li> </ul>"
# >> }
# >> ]
我沒有試圖收集您所要求的具體信息,因爲event_name
不存在,您所做的事情非常通用,一旦理解了一些規則就可以輕鬆完成。
XML通常非常重複,因爲它代表了數據表。表格的「單元格」可能會有所不同,但您可以使用重複來幫助您。在此代碼中
doc.search('entry')
通過<entry>
節點循環。然後,很容易查看它們以找到所需的信息。
XML使用名稱空間來幫助避免標記名稱衝突。起初,這些看起來確實很難,但Nokogiri爲文檔提供了collect_namespaces
方法,該方法返回文檔中所有命名空間的散列。如果您正在查找名稱空間標籤,請將該散列作爲第二個參數傳遞。
Nokogiri允許我們使用XPath和CSS作爲選擇器。爲了便於閱讀,我幾乎總是使用CSS。ns|tag
是告訴Nokogiri使用基於CSS的命名空間標籤的格式。再次,傳遞文檔中名稱空間的散列,Nokogiri將完成其餘部分。
如果您熟悉使用Nokogiri,您會看到上面的代碼與用於將<td>
單元的內容拉到HTML <table>
中的<tr>
行的正常代碼非常相似。
您應該可以修改該代碼來收集所需的數據,而不會冒名稱空間衝突的風險。
您提供純xml的網址。但要嘗試找到它在其中找到HTML。文檔中沒有任何html。 – Aleksey
然後如何使用nokogiri提取內容。 @Aleksey – Ajay
不要使用像'「/ html/body/div [2]/div [2]/div [1]/h3/a/span」'這樣的完整選擇器。他們非常容易出錯。相反,找到所需節點的最短路徑並使用它。這樣,如果文檔佈局更改,選擇器仍然可以正常工作。現在,如果頁面發生了一些變化,你的代碼就會崩潰。 –