2011-06-08 65 views
1

大家。解析與Ruby,Nokogiri和機械化java cookie鏈接在網頁

我需要解析一個網頁,其中包含爲每個鏈接設置的java cookies。我可以解析正常的搜索,並顯示每個產品並將其導入到mysql數據庫。

我能夠從搜索結果每一個產品颳去其與此代碼元素:

這是我有:

require 'rubygems' 
    require 'logger' 
    require 'mechanize' 
    require 'mysql2' 

    agent = WWW::Mechanize.new{|a| a.log = Logger.new(STDERR) } 
    #agent.set_proxy('a-proxy', '8080') 
    agent.read_timeout = 60 

    def add_cookie(agent, uri, cookie) 
     uri = URI.parse(uri) 
     Mechanize::Cookie.parse(uri, cookie) do |cookie| 
     agent.cookie_jar.add(uri, cookie) 
     end 
    end 


    # get main page 
    page = agent.get "http://www.site.com.mx" 

    # get login form 
    form = page.forms.first 
    form.correo_ingresar = "user" 
    form.password = "password" 

    # submit login form 
    page = agent.submit form 

    # parse cookies 
    myarray = page.body.scan(/SetCookie\(\"(.+)\", \"(.+)\"\)/) 

    # set session cookies 
    myarray.each do |item| 
     add_cookie(agent, 'http://www.site.com.mx', "#{item[0]}=#{item[1]}; path=/; domain=www.site.com.mx") 
    end 
    # show 1000 search results per page 
    add_cookie(agent, 'http://www.site.com.mx', "tampag=1000; path=/; domain=www.site.com.mx") 

    # order results 
    add_cookie(agent, 'http://www.site.com.mx', "orden_articulos=existencias asc; path=/; domain=www.site.com.mx") 

    # section results 
    add_cookie (agent, 'http://www.site.com.mx', "codigoseccion_buscar=14; path=/; domain=www.site.com.mx") 

    # get main page 
    page = agent.get "http://www.site.com.mx/tienda/index.php" 

    search_form = page.forms.first 

    search_result = agent.submit search_form 

    doc = Nokogiri::HTML(search_result.body) 

    rows = doc.css("table.articulos tr") 

    i = 0 
    details = rows.collect do |row| 
     detail = {} 
     [ 
     [:sku, 'td[3]/text()'], 
     [:desc, 'td[4]/text()'], 
     [:qty, 'td[5]/text()'], 
     [:qty2, 'td[5]/p/b/text()'], 
     [:price, 'td[6]/text()'] 
     ].collect do |name, xpath| 
     detail[name] = row.at_xpath(xpath).to_s.strip 
     end 
     i = i + 1 
     detail 
    end 

    # walk through paginator links 
    links = doc.css("a.paginar").map {|l| "http://www.site.com.mx#{l['href']}"}.uniq! 

    links.each do |l| 
     page = agent.get l 

     doc = Nokogiri::HTML(page.body) 

     rows = doc.css("table.articulos tr") 

     rows.each do |row| 
      detail = {} 
      [ 
        [:sku, 'td[3]/text()'], 
        [:desc, 'td[4]/text()'], 
        [:qty, 'td[5]/text()'], 
        [:qty2, 'td[5]/p/b/text()'], 
        [:price, 'td[6]/text()'] 
      ].collect do |name, xpath| 
        detail[name] = row.at_xpath(xpath).to_s.strip 
      end 
      details << detail 
     end 
    end 

    # update db 
    client = Mysql2::Client.new(:host => "localhost", :username => "myusername", :password => "mypassword", :database => "mydatabase") 

    details.each do |d| 
     if d[:sku] != "" 
      price = d[:price].split 

      if price[1] == "D" 
       currency = 144 
      else 
       currency = 168 
      end 

      cost = price[0].gsub(",", "").to_f 

      if d[:qty] == "" 
       qty = d[:qty2] 
      else 
       qty = d[:qty] 
      end 

      results = client.query("SELECT * FROM jos_vm_product WHERE product_sku = '#{d[:sku]}' LIMIT 1;") 
      if results.count == 1 
       product = results.first 

          client.query("UPDATE jos_vm_product SET product_sku = '#{d[:sku]}', product_name = '#{d[:desc]}', product_desc = '#{d[:desc]}', product_in_stock = '#{qty}' WHERE product_id = 
    #{product['product_id']};") 

       client.query("UPDATE jos_vm_product_price SET product_price = '#{cost}', product_currency = '#{currency}' WHERE product_id = '#{product['product_id']}';") 
      else 
       client.query("INSERT INTO jos_vm_product(product_sku, product_name, product_desc, product_in_stock) VALUES('#{d[:sku]}', '#{d[:desc]}', '#{d[:desc]}', '#{qty}');") 
       last_id = client.last_id 

       client.query("INSERT INTO jos_vm_product_price(product_id, product_price, product_currency) VALUES('#{last_id}', '#{cost}', #{currency});") 
      end 
     end 
    end 

現在我不希望搜索我想從解析分類列表:
鏈接到主頁:http://www.site.com.mx/tienda/articulos.php?opcion = lineas & seccion_mostrar = 11 這顯示了一個像這樣的表格(所有內容都包含鏈接) 頂端名稱:ACCESORIOS是ACCE類別的鏈接SORIOS,以下列出的大膽名稱是子類別,而粗體名稱下面的是品牌。如果我點擊ACCESORIOS,它會顯示每個品牌和每個子類別混在一起,等等。

ACCESORIOS
ACCESORIOS多媒體(6)
ACTECK DE MEXICO(5),曼哈頓(1)
ACCESORIOS P/IMPRES。銷售點(1)
EPSON CORPORATION(1)
配件佈線配線架(1)
INTELLINET網絡解決方案(1)
數碼相機配件(1)
MANHATTAN( 1)
桌面配件(32)

ACTECK DE MEXICO(2),Generica產品(1),曼哈頓(28),TARGUS(1)
配件筆記本電腦(60)
ACTECK DE MEXICO(3),GENIUS(2),HP COMMERCIAL(2),HP印象(1),曼哈頓(17),PERFECT的選擇(32),SOLIDEX(1) ,TARGUS(1),TECH區(1)
iPod配件(3)
ACTECK DE MEXICO(1),PERFECT的選擇(2)
表配件(3)
MANHATTAN(2 )PERFECT的選擇(1)
Networkinghardware(13)
INTELLINET網絡及通訊RK SOLUTIONS(5),曼哈頓(8)
Accesoriso移動電話(14)
BLACKBERRY(14)
藍牙適配器(6)
ACTECK DE MEXICO(1),曼哈頓(2),PERFECT選擇(3)
適配器鍵盤和鼠標(3)
MANHATTAN(2),PERFECT的選擇(1)
Audifono /頭帶和麥克風(49)

ACTECK d ÈMEXICO(14),BTO(1),GENIUS(3)LOGITECH(2),曼哈頓(11),PERFECT的選擇(18)

這裏是具有曲奇餅爲每個鏈接表中的代碼,即所以我一直有一個很難刮本。

<table width="95%" cellspacing="0" cellpadding="3" border="0"> 
    <tbody> 
    <tr> 
    <td valign="top" align="left" style="font-family: verdana; font-size: 12px" colspan="2"><a onClick="fijar_filtro('codigoseccion_buscar','11')" href="javascript:void(0)" class="busquedas"><b>ACCESORIOS</b></a></td> 
    </tr> 
    <tr> 
    <td width="20" valign="top" align="left"></td> 
    <td valign="top" align="left" style="font-family: verdana; font-size: 12px"><a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','338')" href="javascript:void(0)" class="busquedas"><b>Accesorios Multimedia</b>(6)</a><br> 
    <a onClick="SetCookie('codigolinea_buscar','338');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (5)</a>, <a onClick="SetCookie('codigolinea_buscar','338');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (1)</a><br> 
    <br> 
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','540')" href="javascript:void(0)" class="busquedas"><b>Accesorios P/impres. Punto De Venta</b>(1)</a><br> 
    <a onClick="SetCookie('codigolinea_buscar','540');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','106');" href="javascript:void(0)" class="busquedas">EPSON CORPORATION (1)</a><br> 
    <br> 
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','542')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Cableados De Patch Panels</b>(1)</a><br> 
    <a onClick="SetCookie('codigolinea_buscar','542');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','635');" href="javascript:void(0)" class="busquedas">INTELLINET NETWORK SOLUTIONS (1)</a><br> 
    <br> 
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','361')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Camaras Digitales</b>(1)</a><br> 
    <a onClick="SetCookie('codigolinea_buscar','361');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (1)</a><br> 
    <br> 
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','277')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Computadoras De Escritorio</b>(32)</a><br> 
    <a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (2)</a>, <a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','530');" href="javascript:void(0)" class="busquedas">GENERICA (1)</a>, <a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (28)</a>, <a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','586');" href="javascript:void(0)" class="busquedas">TARGUS (1)</a><br> 
    <br> 
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','357')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Computadoras Portatiles</b>(60)</a><br> 
    <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (3)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','167');" href="javascript:void(0)" class="busquedas">GENIUS (2)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','694');" href="javascript:void(0)" class="busquedas">HP COMERCIAL (2)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','107');" href="javascript:void(0)" class="busquedas">HP IMPRESION (1)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (17)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (32)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','212');" href="javascript:void(0)" class="busquedas">SOLIDEX (1)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','586');" href="javascript:void(0)" class="busquedas">TARGUS (1)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','691');" href="javascript:void(0)" class="busquedas">TECH ZONE (1)</a><br> 
    <br> 
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1302')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Ipod</b>(3)</a><br> 
    <a onClick="SetCookie('codigolinea_buscar','1302');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (1)</a>, <a onClick="SetCookie('codigolinea_buscar','1302');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (2)</a><br> 
    <br> 
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1175')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Mesas</b>(3)</a><br> 
    <a onClick="SetCookie('codigolinea_buscar','1175');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (2)</a>, <a onClick="SetCookie('codigolinea_buscar','1175');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (1)</a><br> 
    <br> 
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','292')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Redes</b>(13)</a><br> 
    <a onClick="SetCookie('codigolinea_buscar','292');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','635');" href="javascript:void(0)" class="busquedas">INTELLINET NETWORK SOLUTIONS (5)</a>, <a onClick="SetCookie('codigolinea_buscar','292');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (8)</a><br> 
    <br> 
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1378')" href="javascript:void(0)" class="busquedas"><b>Accesoriso Para Celulares</b>(14)</a><br> 
    <a onClick="SetCookie('codigolinea_buscar','1378');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','714');" href="javascript:void(0)" class="busquedas">BLACKBERRY (14)</a><br> 
    <br> 
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1313')" href="javascript:void(0)" class="busquedas"><b>Adaptador Bluetooth</b>(6)</a><br> 
    <a onClick="SetCookie('codigolinea_buscar','1313');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (1)</a>, <a onClick="SetCookie('codigolinea_buscar','1313');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (2)</a>, <a onClick="SetCookie('codigolinea_buscar','1313');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (3)</a><br> 
    <br> 
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','555')" href="javascript:void(0)" class="busquedas"><b>Adaptadores Para Mouse Y Teclado</b>(3)</a><br> 
    <a onClick="SetCookie('codigolinea_buscar','555');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (2)</a>, <a onClick="SetCookie('codigolinea_buscar','555');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (1)</a><br> 
    </td> 
    </tr> 
    </tbody> 
    </table> 

所以問題是,我該怎麼添加到我的代碼能夠訪問的每一個環節?如果用java餅乾。

餅乾使用:
名稱,值範圍
codigoseccion_buscar,11-30
codigomarca_buscar,100-736
codigolinea_buscar,15-1385

回答

1

我設法湊其中一個內容鏈接將餅乾到我的Ruby代碼:

# set cookies 
    add_cookie(agent, 'http://www.site.com.mx', "codigoseccion_buscar=11; path=/; domain=www.site.com.mx") 

    add_cookie(agent, 'http://www.site.com.mx', "codigolinea_buscar=; path=/; domain=www.site.com.mx") 

    add_cookie(agent, 'http://www.site.com.mx', "codigomarca_buscar=; path=/; domain=www.site.com.mx") 

    add_cookie(agent, 'http://www.site.com.mx', "textobuscar=; path=/; domain=www.site.com.mx") 

這奇怪的是,如果我只阿迪d其中一個餅乾是行不通的。所以我不得不添加所有,甚至儘管他們沒有任何價值,因爲每一個環節都有一個cookie,所以這種方式會刪除或清除保存的Cookie。

現在我需要刮這些cookies使用它作爲變量,並做一個循環或什麼,任何人都可以幫助我?

<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','542')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Cableados De Patch Panels</b>(1)</a><br>