2012-08-01 52 views
0

我有一個Xpath查詢其接受使用Axslx輸出數組元素,我需要整理我的輸出中的某些條件,其中之一是「軟件包括」axslx - 我該如何檢查一個數組元素是否存在,如果改變了它的輸出?

我的XPath刮下面的網址http://h10010.www1.hp.com/wwpc/ie/en/ho/WF06b/321957-321957-3329742-89318-89318-5186820-5231694.html?dnr=1

我的代碼示例如下:

clues = Array.new 
clues << 'Optical drive' 
clues << 'Pointing device' 
clues << 'Software included' 

selector = "//td[text()='%s']/following-sibling::td" 

data = clues.map do |clue| 
     xpath = selector % clue 
     [clue, doc.at(xpath).text.strip] 
     end 

Axlsx::Package.new do |p| 
    p.workbook.add_worksheet do |sheet| 
    data.each { |datum| sheet.add_row datum } 
    end 
    p.serialize 'output.xlsx' 
end 

我的電流輸出格式

enter image description here

我所需的輸出格式

enter image description here

回答

0

如果你可以依靠始終使用數據 ';'作爲分隔符,必須在此一展身手:

data = [] 
clues.each do |clue| 
    xpath = selector % clue 
    details = doc.at(xpath).text.strip.split(';') 
    data << [clue, details.pop] 
    details.each { |detail| data << ['', detail] } 
end 

生成數據的Axlsx :: Package.new阻止

之前,在回答您評論/問題:你有像這樣做;)

require 'rubygems' 
require 'nokogiri' 
require 'open-uri' 
require 'axlsx' 

class Scraper 

    def initialize(url, selector) 
    @url = url 
    @selector = selector 
    end 

    def hooks 
    @hooks ||= {} 
    end 

    def add_hook(clue, p_roc) 
    hooks[clue] = p_roc 
    end 

    def export(file_name) 
    Scraper.clues.each do |clue| 
     if detail = parse_clue(clue) 
     output << [clue, detail.pop] 
     detail.each { |datum| output << ['', datum] } 
     end 
    end 
    serialize(file_name) 
    end 

    private 

    def self.clues 
    @clues ||= ['Operating system', 'Processors', 'Chipset', 'Memory type', 'Hard drive', 'Graphics', 
       'Ports', 'Webcam', 'Pointing device', 'Keyboard', 'Network interface', 'Chipset', 'Wireless', 
       'Power supply type', 'Energy efficiency', 'Weight', 'Minimum dimensions (W x D x H)', 
       'Warranty', 'Software included', 'Product color'] 
    end 

    def doc 
    @doc ||= begin 
       Nokogiri::HTML(open(@url)) 
       rescue 
       raise ArgumentError, 'Invalid URL - Nothing to parse' 
       end 
    end 

    def output 
    @output ||= [] 
    end 

    def selector_for_clue(clue) 
    @selector % clue 
    end 

    def parse_clue(clue) 
    if element = doc.at(selector_for_clue(clue)) 
     call_hook(clue, element) || element.inner_html.split('<br>').each(&:strip) 
    end 
    end 

    def call_hook(clue, element) 
    if hooks[clue].is_a? Proc 
     value = hooks[clue].call(element) 
     value.is_a?(Array) ? value : [value] 
    end 
    end 

    def package 
    @package ||= Axlsx::Package.new 
    end 

    def serialize(file_name) 
    package.workbook.add_worksheet do |sheet| 
     output.each { |datum| sheet.add_row datum } 
    end 
    package.serialize(file_name) 
    end 
end 

scraper = Scraper.new("http://h10010.www1.hp.com/wwpc/ie/en/ho/WF06b/321957-321957-3329742-89318-89318-5186820-5231694.html?dnr=1", "//td[text()='%s']/following-sibling::td") 

# define a custom action to take against any elements found. 
os_parse = Proc.new do |element| 
    element.inner_html.split('<br>').each(&:strip!).each(&:upcase!) 
end 

scraper.add_hook('Operating system', os_parse) 

scraper.export('foo.xlsx') 

而最終的答案是...一顆寶石。

http://rubydoc.info/gems/ninja2k/0.0.2/frames

+0

嗨蘭迪,不幸的';'手動添加到我自己。有沒有辦法執行一個操作,如array.element =「Software includes」do fuction {}? – Ninja2k 2012-08-02 06:54:16

+0

我已經編輯了答案,以顯示可以完成的一種方式。 – randym 2012-08-02 09:02:47

+0

該死的幾乎是一個全新的寶石:P謝謝你,但它是我的頭。有沒有辦法讓它變得非常簡單?像4線操作? – Ninja2k 2012-08-02 11:55:41

相關問題