最後,我們使用的代碼如下所示。它是針對問題中給出的示例顯示的,但也適用於任意深度HTML標記嵌套的通用情況。 (這是我們需要的。)
另外,我們以一種可以忽略一行中多餘(≥2)個空白字符的方式實現它。這就是爲什麼我們必須搜索匹配的結尾,不能只使用搜索字符串/引用的長度和匹配位置的開始:搜索字符串中的空格字符數和搜索匹配可能不同。
doc = Nokogiri::HTML.fragment("<div>Lorem ipsum <b>dolor sit</b> amet.</div>")
quote = 'ipsum dolor'
# Find search string in document text, "plain text in plain text".
quote_query =
quote.split(/[[:space:]]+/).map { |w| Regexp.quote(w) }.join('[[:space:]]+')
start_index = doc.text.index(/#{quote_query}/i)
end_index = start_index+doc.text[/#{quote_query}/i].size
# Find XPath values and character indexes for start and stop of search match.
# For that, walk through all text nodes and count characters until reaching
# the start and end positions of the search match.
start_xpath, start_offset, end_xpath, end_offset = nil
i = 0
doc.xpath('.//text() | text()').each do |x|
offset = 0
x.text.split('').each do
if i == start_index
e = x.previous
sum = 0
while e
sum+= e.text.size
e = e.previous
end
start_xpath = x.path.gsub(/^\?/, '').gsub(
/#{Regexp.quote('/text()')}.*$/, ''
)
start_offset = offset+sum
elsif i+1 == end_index
e = x.previous
sum = 0
while e
sum+= e.text.size
e = e.previous
end
end_xpath = x.path.gsub(/^\?/, '').gsub(
/#{Regexp.quote('/text()')}.*$/, ''
)
end_offset = offset+1+sum
end
offset+=1
i+=1
end
end
在這一點上,我們可以檢索的搜索匹配的開始和停止所需的XPath值(和另外,字符偏移指向的XPath指定元素中的確切字符的開始和停止搜索匹配)。我們得到:
puts start_xpath
/div
puts start_offset
6
puts end_xpath
/div/b
puts end_offset
5