如何使用Nokogiri和XPath或CSS選擇器來選擇一大塊HTML？

在我的Rails應用程序中，我有如下的HTML，在Nokogiri中解析。如何使用Nokogiri和XPath或CSS選擇器來選擇一大塊HTML？

我希望能夠選擇HTML的塊。例如，如何使用XPath或CSS選擇屬於<sup id="21">的HTML塊？假設在真正的HTML中，********的部分不存在。

我想分割HTML <sup id=*>但問題是節點是兄弟姐妹。

<sup class="v" id="20"> 
1 
</sup> 
this is some random text 
<p></p>    
more random text 
<sup class="footnote" value='fn1'> 
[v] 
</sup> 

# ****************************** starting here 
<sup class="v" id="21"> 
2 
</sup> 
now this is a different section 
<p></p>    
how do we keep this separate 
<sup class="footnote" value='fn2'> 
[x] 
</sup> 
# ****************************** ending here 

<sup class="v" id="23"> 
3 
</sup> 
this is yet another different section 
<p></p>    
how do we keep this separate too 
<sup class="footnote" value='fn3'> 
[r] 
</sup>

來源

2011-12-13 Brand

這裏有一個簡單的解決方案，讓您NodeSet s的所有<sup … class="v">之間的節點，通過他們的id散列。

doc = Nokogiri.HTML(your_html) 

nodes_by_vsup_id = Hash.new{ |k,v| k[v]=Nokogiri::XML::NodeSet.new(doc) } 
last_id = nil 
doc.at('body').children.each do |n| 
    last_id = n['id'] if n['class']=='v' 
    nodes_by_vsup_id[last_id] << n 
end 

puts nodes_by_vsup_id['21'] 
#=> <sup class="v" id="21"> 
#=> 2 
#=> </sup> 
#=> 
#=> now this is a different section 
#=> <p></p> 
#=>  
#=> how do we keep this separate 
#=> <sup class="footnote" value="fn2"> 
#=> [x] 
#=> </sup>

或者，如果你真的不想要的分界「SUP」是集合的一部分，而不是做：

doc.at('body').elements.each do |n| 
    if n['class']=='v' 
    last_id = n['id'] 
    else 
    nodes_by_vsup_id[last_id] << n 
    end 
end

這裏有一個替代方案，偶更通用的解決方案：

class Nokogiri::XML::NodeSet 
    # Yields each node in the set to your block 
    # Returns a hash keyed by whatever your block returns 
    # Any nodes that return nil/false are grouped with the previous valid value 
    def group_chunks 
    Hash.new{ |k,v| k[v] = self.class.new(document) }.tap do |result| 
     key = nil 
     each{ |n| result[key = yield(n) || key] << n } 
    end 
    end 
end 

root_items = doc.at('body').children 
separated = root_items.group_chunks{ |node| node['class']=='v' && node['id'] } 
puts separated['21']

來源

2011-12-13 20:27:23 Phrogz

你在那裏哼哼Nokogiri，對不對？ –

@DavidWest這是正確的，最後的「甚至更通用」的代碼是「重新打開」Nokogiri類，並添加一個新的實例方法，即「monkeypatching」。 – Phrogz

-1

require 'open-uri' 
require 'nokogiri' 

doc = Nokogiri::HTML(open("http://www.yoururl")) 
doc.xpath('//sup[id="21"]').each do |node| 
    puts node.text 
end

來源

2011-12-13 13:50:05 abcde123483

由於a）您的XPath無效，並且b）這不能解決OP所要求的問題（直到下一個類似元素爲止）。 – Phrogz

看起來你要選擇的用sup和@id='21'的sup與@id='23'之間的一切。使用下面的即席表達：

//sup[@id='21']|(//sup[@id='21']/following-sibling::node()[ 
    not(self::sup[@id='23'] or preceding-sibling::sup[@id='23'])])

還是Kayessian節點集合相交公式的應用：

//sup[@id='21']|(//sup[@id='21']/following-sibling::node()[ 
    count(.|//sup[@id='23']/preceding-sibling::node()) 
    = 
    count(//sup[@id='23']/preceding-sibling::node())])

來源

2011-12-13 16:24:29

如何使用Nokogiri和XPath或CSS選擇器來選擇一大塊HTML？

回答

相關問題