Couting與Ruby風格的html標籤（注入，塊，每個...）

我想要計算某個頁面中幾個html標籤的出現次數。我可以用經典的方式做到這一點，但我試圖用Ruby的方式來做到這一點。Couting與Ruby風格的html標籤（注入，塊，每個...）

這是我做的，但不是增加了各部分的計數，它創建與列表中的元素的字符串：

tags = [ '<img>', '<script>', '<applet>', '<video>', '<audio>' ] 
weight = tags.each { |tag| web.to_s.scan(/#{tag}/).length }.inject(:+)

任何提示嗎？

編輯：

def browse startpage, depth, block 
    if depth > 0 
     begin 
      web = open(startpage).read 
      block.call startpage, web 
     rescue 
      return 
     end 
     links = URI.extract(web) 
     links.each { |link| browse link, depth-1, block } 
    end 
end 

browse("https://www.youtube.com/", 2, lambda { |page_name, web| 
    tags = [ '<img>', '<script>', '<applet>', '<video>', '<audio>' ] 
    web.force_encoding 'utf-8' 
    parsed_string = Nokogiri::HTML(web) 
    weight = tags.each_with_object(Hash.new(0)) do |tag, hash| 
     occurrences = parsed_string.xpath("//#{tag.gsub(/[<>]/, '')}").length 
     hash[tag] = occurrences 
    end 
    puts "Page weight for #{web.base_uri} = #{weight}" 
})

來源

2014-11-06 dabadaba

只是更換''each' map' – 2014-11-06 12:36:44

Revemo救援塊與返回一起，這使調試不可能。 – daremkd 2014-11-06 13:10:11

這裏是爲您解決問題的途徑之一：

web = "<audio> <audio> <video>" # I guess 'web' is other than a string in your example, so the need for to_s below 
tags = [ '<img>', '<script>', '<applet>', '<video>', '<audio>' ] 

tag_occurrences = tags.each_with_object(Hash.new(0)) do |tag, hash| 
    occurrences = web.to_s.scan(/#{tag}/).length 
    hash[tag] = occurrences 
end 

p tag_occurrences #=> {"<img>"=>0, "<script>"=>0, "<applet>"=>0, "<video>"=>1, "<audio>"=>2}

它不建議您使用正則表達式匹配的標籤，雖然。更好的方法是使用類似引入nokogiri來算標籤：

require 'nokogiri' 
web = "<audio> <audio> <video>" 
parsed_string = Nokogiri::HTML(web.to_s) #using to_s because I'm assuming web isn't an actual string in your code 
tags = [ '<img>', '<script>', '<applet>', '<video>', '<audio>' ] 

tag_occurrences = tags.each_with_object(Hash.new(0)) do |tag, hash| 
    occurrences = parsed_string.xpath("//#{tag.gsub(/[<>]/, '')}").length 
    hash[tag] = occurrences 
end 

p tag_occurrences #=> {"<img>"=>0, "<script>"=>0, "<applet>"=>0, "<video>"=>1, "<audio>"=>2}

關於你的評論，我已經（使用我的代碼片斷第二處理數據）用這個在YouTube上，並得到：

require 'open-uri' 
web = open('http://youtube.com').read 
# the code above to parse web using Nokogiri 
p tag_occurrences #=> {"<img>"=>151, "<script>"=>13, "<applet>"=>0, "<video>"=>0, "<audio>"=>0}

來源

2014-11-06 12:21:36 daremkd

我得到兩個方法計數的0個標籤。我正在使用的頁面是YouTube。哪裏不對？ – dabadaba 2014-11-06 12:50:12

你用什麼從YouTube獲取數據？ – daremkd 2014-11-06 12:52:19

請參閱我對YouTube評論的更新回答。 – daremkd 2014-11-06 12:55:10

我會traverse文件一次，計算節點名稱：

doc = Nokogiri::HTML(open('https://www.youtube.com/')) 
tags_count = Hash.new(0) 
doc.traverse { |node| tags_count[node.name] += 1 } 
tags_count 
#=> {"html"=>2, "#cdata-section"=>12, "script"=>15, "text"=>7958, "link"=>11, "title"=>1, "meta"=>4, "comment"=>18, "head"=>1, "div"=>1152, "input"=>2, "form"=>2, "img"=>135, "span"=>2878, "a"=>397, "button"=>434, "label"=>1, "li"=>740, "ul"=>265, "hr"=>3, "h3"=>117, "p"=>48, "br"=>3, "strong"=>2, "ol"=>1, "h2"=>26, "b"=>5, "body"=>1, "document"=>1}

來源

2014-11-06 13:32:26 Stefan

Couting與Ruby風格的html標籤（注入，塊，每個...）

回答

相關問題