使用nokogiri解析帶有嵌套循環的html樹

嗨，我是nokogiri的新手，並嘗試使用各種樹結構來解析HTML文檔。任何關於如何解析它的建議都會很棒。我想要捕獲此頁面上的所有文字。使用nokogiri解析帶有嵌套循環的html樹

<div class = "main"> Title</div> 
<div class = "subTopic"> 
    <span = "highlight">Sub Topic</span>Stuff 
</div> 

<div class = "main"> Another Title</div> 
<div class = "subTopic"> 
    <span class = "highlight">Sub Topic Title I</span>Stuff<br> 
    <span class = "highlight">Sub Topic Title II</span>Stuff<br> 
    <span class = "highlight">Sub Topic Title III</span>Stuff<br> 
</div>

我試過，但它只是推出每滿陣，我甚至不知道如何去的「東西」的一部分。

content = Nokogiri::HTML(open(@url)) 
content.css('div.main').each do |m| 
    puts m .text 
    content.css('div.subTopic').each do |s| 
     puts s.text 
     content.css('span.highlight').each do |h| 
      puts h.text 
     end 
    end 
end

幫助將不勝感激。

來源

2013-03-14 haley

有什麼特別的原因，你正在使用nokogiri做這個？ – dezman 2013-03-14 04:32:13

我在Rails/Ruby中做這件事。有沒有其他工具可以建議？ – haley 2013-03-14 04:35:37

根據你的情況，最好用JS做客戶端。 – 2013-03-14 04:45:34

類似的東西會解析您的轉移文檔結構：

數據

<div class="main"> Title</div> 
<div class="subTopic"> 
    <span class="highlight">Sub Topic</span>Stuff 
</div> 

<div class = "main"> Another Title</div> 
<div class = "subTopic"> 
    <span class = "highlight">Sub Topic Title I</span>Stuff<br> 
    <span class = "highlight">Sub Topic Title II</span>Stuff<br> 
    <span class = "highlight">Sub Topic Title III</span>Stuff<br> 
</div>

代碼：

require 'nokogiri' 
require 'pp' 

content = Nokogiri::HTML(File.read('text.txt')); 

topics = content.css('div.main').map do |m| 
    topic={} 
    topic['title'] = m.text.strip 
    topic['highlights'] = m.xpath('following-sibling::div[@class=\'subTopic\'][1]').css('span.highlight').map do |h| 
     topic_highlight = {} 
     topic_highlight['highlight'] = h.text.strip 
     topic_highlight['text'] = h.xpath('following-sibling::text()[1]').text.strip 
     topic_highlight 
    end 
    topic 
end 

pp topics

會打印：

[{"title"=>"Title", 
    "highlights"=>[{"highlight"=>"Sub Topic", "text"=>"Stuff"}]}, 
{"title"=>"Another Title", 
    "highlights"=> 
    [{"highlight"=>"Sub Topic Title I", "text"=>"Stuff"}, 
    {"highlight"=>"Sub Topic Title II", "text"=>"Stuff"}, 
    {"highlight"=>"Sub Topic Title III", "text"=>"Stuff"}]}]

來源

2013-03-14 05:01:49 Strelok

謝謝@Strelok！真的很有幫助。我得到它的工作，但.map對我來說是新的。試圖研究它，並得到Enumerable，但仍然不能得到爲什麼'主題'和'topic_highlight'在他們的循環結束時使用。我試圖將它們排除在外，看起來它們就像一個櫃檯。是對的嗎？或者，如果答案太長，如果你不介意指向主題，我可以谷歌這將是偉大的。再次感謝。 – haley 2013-03-14 07:01:28

[Ruby中的「map」方法有什麼作用？]（http://stackoverflow.com/questions/12084507/what-does-the-map-method-do-in-ruby）會回答你關於'地圖'方法。 Ruby中的每個方法默認都會返回一個值。這個返回的值將是最後一條語句的值。所以'topic'和'topic_highlight'是塊的返回值。 – Strelok 2013-03-14 22:44:04

使用nokogiri解析帶有嵌套循環的html樹

回答

相關問題