由R下載的源代碼和網站源代碼的差異

我在提取有關某些產品的信息的網站，但我遇到了價格方面的問題。我的代碼如下：由R下載的源代碼和網站源代碼的差異

> enlace<-"http://www.carulla.com/products/0000687608965009/Crema+Dental+Sensitive+Proalivio+Colgate" 
> download.file(enlace, destfile = "scrapedpage.html", quiet=TRUE) 
> doc<-read_html("scrapedpage.html") 
> # description 
> toString(xml_find_all(doc,xpath=paste0('//*[@id="pdpProduct"]/div[3]/h3'))) 
[1] "<h3 class=\"pdpInfoProductName\" itemprop=\"name\">Crema Dental Sensitive Proalivio Colgate</h3>" 
> # reference 
> toString(xml_find_all(doc,xpath=paste0('//*[@id="pdpProduct"]/div[3]/p'))) 
[1] "<p class=\"pdpInfoProductRef\">\r\n\t\t\t\t\t\t\t\t\tPresentación:C \r\n\t\t\t\t\t\t\t\t\tPLU:739983</p>" 
> # prices 
> toString(xml_find_all(doc,xpath=paste0('//*[@id="pdpProduct"]/div[3]/div[1]/div[2]/h4'))) 
[1] ""

我在原來的頁面，在那裏我找到這個檢查源代碼這樣的信息：

<div class="pdpInfoProduct pull-left"> 
      <h3 class="pdpInfoProductName" itemprop="name">Crema Dental Sensitive Proalivio Colgate</h3> 
      <h2 class="pdpInfoProductBrand" itemprop="brand">COLGATE</h2> 
      <p class="pdpInfoProductRef"> 
           Presentación:C&nbsp; 
           PLU:739983</p> 
         <div class="pdpInfoProductPrices"> 
       <div class="pull-right"> 
          <div class="pro-big-Ovalo"> 
           <p>25%</p> 
          </div> 
         </div> 
        <div class="pdpInfoProductPrice" itemprop="offers" itemscope itemtype="http://schema.org/Offer"> 

       <meta itemprop="priceCurrency" content="COP" /> 
        <meta itemprop="price" content="17213.0" /> 
        <h4 class="priceOffer"> 
         $17.213</h4> 
        <h6 class="before">Antes: <span class="strikeText"> 
           $22.950</span> 
         </h6> 
        </div> 
      </div>

我感興趣的信息是17.213 $，但是當我嘗試下載其中R的源代碼，我得到如下：

> con2<-url(enlace,"r") 
> x<-readLines(con2) 
> close(con2) 
> x[1270:1285] 
[1] "\t\t\t\t\t\t\t\t\tPLU:739983</p>"                                     
[2] "\t\t\t\t\t\t\t<div class=\"pdpInfoProductPrices\">\t"                               
[3] "\t\t\t\t\t<div class=\"pdpInfoProductPrice\" itemprop=\"offers\" itemscope itemtype=\"http://schema.org/Offer\">"                
[4] "\t\t\t\t\t"                                         
[5] "\t\t\t\t\t<meta itemprop=\"priceCurrency\" content=\"COP\" />"                            
[6] "      <meta itemprop=\"price\" content=\"\" />"                          
[7] "\t\t\t\t\t\t<h4 class=\"price\">"                                    
[8] "\t\t\t\t\t\t\t</h4>"                                       
[9] "\t\t\t\t\t\t</div>"                                       
[10] "\t\t\t\t</div>"                                        
[11] "\t\t\t\t"                                         
[12] "\t\t\t\t\t\t\t\t\t"                                        
[13] "\t\t\t\t\t\t\t\t\t\t\t\t\t <div class=\"product-seller row-fluid\">"                             
[14] "\t\t\t\t  <!-- +++++ Carulla Seller +++++ -->            "                   
[15] "        <p> Vendido por: &nbsp Carulla</p>                          " 
[16] "     </div>"

即，我獲得\噸\噸\噸\噸\噸\噸\噸，而不是17.213 $。

我會非常感謝您的幫助。

來源

2017-04-26 fcochaux

該網站可能會檢查UA和Cookie，以防止您執行您正在做的事。我只是試圖用wget下載它，並且只是平坦地出現了403 Forbidden錯誤。

現在，網絡抓取的想法已經過時了，至少對於商業網頁而言。有一些解決方法（例如，您可以檢查download.file（）的幫助，並閱讀wget和curl的聯機幫助頁以瞭解如何更改UA並導入cookie），但是如果您確實想要按比例，您可能需要查看瀏覽器腳本，然後將該數據導入到R中。

請記住，您正在執行網站所有者不希望您執行的操作。總之，這與R幾乎無關。

來源

2017-04-26 15:42:47

由R下載的源代碼和網站源代碼的差異

回答

相關問題