如何跨HTML標籤邊界查找文本（使用XPath指針作爲結果）？

我有HTML這樣的：如何跨HTML標籤邊界查找文本（使用XPath指針作爲結果）？

<div>Lorem ipsum <b>dolor sit</b> amet.</div>

我怎樣才能找到一個簡單的基於文本的比賽在這個HTML我的搜索字符串ipsum dolor？我需要匹配的開始和結束XPath節點指針，以及指向這些開始和結束節點內部的字符索引。我使用Nokogiri來處理DOM，但任何Ruby解決方案都可以。

難度：

我不能node.traverse {|node| … }通過DOM和做每當一個文本節點遇到，因爲我的搜索字符串能夠跨越邊界標記純文本搜索。
將HTML轉換爲純文本後，我無法進行純文本搜索，因爲我需要XPath索引作爲結果。

我可以基本樹的遍歷實現它自己，但之前，我做我問，如果有一個引入nokogiri功能或技巧來做到這一點更舒適。

來源

2017-09-07 tanius

最後，我們使用的代碼如下所示。它是針對問題中給出的示例顯示的，但也適用於任意深度HTML標記嵌套的通用情況。（這是我們需要的。）

另外，我們以一種可以忽略一行中多餘（≥2）個空白字符的方式實現它。這就是爲什麼我們必須搜索匹配的結尾，不能只使用搜索字符串/引用的長度和匹配位置的開始：搜索字符串中的空格字符數和搜索匹配可能不同。

 
doc = Nokogiri::HTML.fragment("<div>Lorem ipsum <b>dolor sit</b> amet.</div>") 
quote = 'ipsum dolor' 


# Find search string in document text, "plain text in plain text". 

quote_query = 
    quote.split(/[[:space:]]+/).map { |w| Regexp.quote(w) }.join('[[:space:]]+') 
start_index = doc.text.index(/#{quote_query}/i) 
end_index = start_index+doc.text[/#{quote_query}/i].size 


# Find XPath values and character indexes for start and stop of search match. 
# For that, walk through all text nodes and count characters until reaching 
# the start and end positions of the search match. 

start_xpath, start_offset, end_xpath, end_offset = nil 
i = 0 

doc.xpath('.//text() | text()').each do |x| 
 offset = 0 
 x.text.split('').each do 
   if i == start_index 
     e = x.previous 
     sum = 0 
     while e 
       sum+= e.text.size 
       e = e.previous 
     end 
     start_xpath = x.path.gsub(/^\?/, '').gsub(
     /#{Regexp.quote('/text()')}.*$/, '' 
    ) 
     start_offset = offset+sum 
   elsif i+1 == end_index 
     e = x.previous 
     sum = 0 
     while e 
       sum+= e.text.size 
       e = e.previous 
     end 
     end_xpath = x.path.gsub(/^\?/, '').gsub(
     /#{Regexp.quote('/text()')}.*$/, '' 
    ) 
     end_offset = offset+1+sum 
   end 
   offset+=1 
   i+=1 
 end 
end

在這一點上，我們可以檢索的搜索匹配的開始和停止所需的XPath值（和另外，字符偏移指向的XPath指定元素中的確切字符的開始和停止搜索匹配）。我們得到：

puts start_xpath 
    /div 
puts start_offset 
    6 
puts end_xpath 
    /div/b 
puts end_offset 
    5

來源

2017-09-11 17:03:29 tanius

你可以這樣做：

doc.search('div').find{|div| div.text[/ipsum dolor/]}

來源

2017-09-08 02:06:56 pguardiario

如何跨HTML標籤邊界查找文本（使用XPath指針作爲結果）？

回答

相關問題