2011-05-22 65 views
0

解釋在評論中。我把它放在那裏,因爲它被解釋爲粗體或其他東西,並且它擰緊了帖子。捕獲標籤之間的特定文本

# I need to capture text that is 
# enclosed in tags that are both <b> and 
# <i>, but if there is more than one 
# text enclosed in <i> in the same <b> 
# block, then I only want the text 
# enclosed in the first <i> tag, For 
# example, for the following line: 
# 
# <b> <i> Important text here </i> 
# irrelevant text everywhere else <i> 
# irrelevant text here </i> </b> <b> 
# <i> Also Important </i> not important 
# <i> not important </i> </b> 
# 
# I want to retrieve only: 
# - Important text here 
# - Also Important 
# 
# I also must not retrieve text inside an 
# <h2> block. I have been trying to 
# delete the block with nodes.delete(nodes. search('h2')), 
# but it doesn't actually delete the h2 block 


require "rubygems" 
require "nokogiri" 

html = <<EOT 
    <b><i> Important text here </i> more text <i> not important text here </i> </b> 
    <b> <i> Also Important </i> more text <i> not important </i> </b> 

    <h2><b> <i> I don't want this text either</i></b></h2> 
EOT 


doc = Nokogiri::HTML(html) 

nodes = doc.search('b i') 

nodes.each { |e| puts e } 

# Expected output: 
# Important text here 
# Also Important 

回答

0
require "nokogiri" 
require 'pp' 
html = <<EOT 
    <b><i>Important text here</i>more text<i>not important text here</i></b> 
    <b><i>Also Important</i>more text<i>not important</i></b> 

    <h2><b><i>I don't want this text either</i></b></h2> 
EOT 


doc = Nokogiri::HTML(html) 
nodes = doc.search('b') 
nodes.each { |e| puts e.children.children.first unless e.parent.name == "h2" } 

或使用XPath:

nodes = doc.xpath("//../*[local-name() != 'h2']/b/i[1]") 
nodes.each { |e| puts e.children.first}