rvest網頁內容報廢問題/車交易網站

我想rvest網站（汽車銷售平臺）的特定部分。

這個CSS對我來說太困惑了，以至於我無法自己弄清楚什麼是錯的。

#### scraping the website www.otomoto.pl with used cars ##### 

baseURL_otomoto = "https://www.otomoto.pl/osobowe/?page=" 

i <- 1 

for (i in 1:7000) 
{ 
    link = paste0(baseURL_otomoto,i) 
    out = read_html(link) 
    print(i) 
    print(link) 

    ### building year 
    build_year = html_nodes(out, xpath = '//*[@id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[1]') %>% 
    html_text() %>% 
    str_replace_all("\n","") %>% 
    str_replace_all("\r","") %>% 
    str_trim() 

    mileage = html_nodes(out, xpath = '//*[@id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[2]') %>% 
    html_text() %>% 
    str_replace_all("\n","") %>% 
    str_replace_all("\r","") %>% 
    str_trim() 

    volume = html_nodes(out, xpath = '//*[@id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[3]') %>% 
    html_text() %>% 
    str_replace_all("\n","") %>% 
    str_replace_all("\r","") %>% 
    str_trim() 

    fuel_type = html_nodes(out, xpath = '//*[@id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[4]') %>% 
    html_text() %>% 
    str_replace_all("\n","") %>% 
    str_replace_all("\r","") %>% 
    str_trim() 


    price = html_nodes(out, xpath = '//div[@class="offer-item__price"]') %>% 
    html_text() %>% 
    str_replace_all("\n","") %>% 
    str_replace_all("\r","") %>% 
    str_trim() 

    link = html_nodes(out, xpath = '//div[@class="offer-item__title"]') %>% 
    html_text() %>% 
    str_replace_all("\n","") %>% 
    str_replace_all("\r","") %>% 
    str_trim() 

    offer_details = html_nodes(out, xpath = '//*[@id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul') %>% 
    html_text() %>% 
    str_replace_all("\n","") %>% 
    str_replace_all("\r","") %>% 
    str_trim()

任何猜測可能是什麼原因造成這種行爲？

PS＃1。

如何將所分析的網站上提供的所有build_type，mileage和fuel_type數據一次作爲data.frame？使用類（xpath ='// div [@class = ...）在我的情況下不起作用

PS＃2。

我想通過f.i獲得實際報價的詳細信息。

gear_type = html_nodes(out, xpath = '//*[@id="parameters"]/ul[1]/li[10]/div') %>% 
    html_text() %>% 
    str_replace_all("\n","") %>% 
    str_replace_all("\r","") %>% 
    str_trim()

參數

在UL並[a]是對於在（1：2）&
並[b]是對於b中（1:12）

不幸的是，雖然這個概念因結果數據框爲空而失敗。任何猜測爲什麼？

來源

2017-04-15 Wojciech Niemczyk

首先，瞭解CSS選擇器和XPath。您的選擇器非常長，非常脆弱（其中一些在兩週後根本不適合我）。例如，而不是：

html_nodes(out, xpath = '//*[@id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[1]') %>% 
    html_text()

你可以寫：

html_nodes(out, css="[data-code=year]") %>% html_text()

其次，閱讀您使用的庫文件。 str_replace_all模式可能是正則表達式，它可以爲您節省一個電話（使用str_replace_all("[\n\r]", "")而不是str_replace_all("\n","") %>% str_replace_all("\r","")）。 html_text可以爲你做文字修整，這意味着str_trim()根本不需要。第三，如果你複製粘貼一些代碼，退後一步，並認爲如果函數不會是更好的解決方案;通常會。在你的情況下，我個人可能會跳過str_replace_all調用，直到數據清理步驟，當我打電話給他們data.frame持有整個報廢的數據。

從您的數據創建data.frame，調用data.frame()函數列名稱和內容，這樣的：

data.frame(build_year = build_year, 
    mileage = mileage, 
    volume = volume, 
    fuel_type = fuel_type, 
    price = price, 
    link = link, 
    offer_details = offer_details)

或者你可以初始化數據。有一列幀只有再進一步增加向量作爲列：

output_df <- data.frame(build_year = html_nodes(out, css="[data-code=year]") %>% html_text(TRUE)) 
output_df$volume <- html_nodes(out, css="[data-code=engine_capacity]") %>% 
    html_text(TRUE)

最後，你應該注意的是data.frame列都必須是相同的長度，而一些數據，你放棄是可選的。在寫這個答案的時候，我幾乎沒有沒有發動機能力的報價，也沒有報價說明。您必須連續使用兩個html_nodes調用（因爲單個CSS選擇器不匹配不存在的內容）。但即使如此，html_nodes也會默默丟棄丟失的數據。這可以通過管道html_nodes輸出html_node調用工作圍繞：

current_df$volume = out %>% html_nodes("ul.offer-item__params") %>% 
    html_node("[data-code=engine_capacity]") %>% 
    html_text(TRUE)

我的方法循環內部的最終版本如下。只需確保在調用它之前初始化空的data.frame，並且將當前迭代的輸出與最終數據幀合併（例如使用rbind），或者每次迭代都會覆蓋前一個結果。或者你可以使用do.call(rbind, lapply())，這是用於這種任務的慣用R。作爲一個方面說明，當抓取大量快速變化的數據時，考慮解耦數據下載和數據處理步驟。想象一下，你有沒有考慮到會導致R終止的一些特殊情況。如果這種情況出現在迭代的中間，你將如何繼續？您留在一頁上的時間越長，引入的重複項越多（因爲出現更多優惠並且現有優惠被推下更多頁面）以及更多優惠（因爲出售已結束並且優惠永久消失）。

current_df <- data.frame(build_year = html_nodes(out, css="[data-code=year]") %>% html_text(TRUE)) 

current_df$mileage = html_nodes(out, css="[data-code=mileage]") %>% 
    html_text(TRUE) 

current_df$volume = out %>% html_nodes("ul.offer-item__params") %>% 
    html_node("[data-code=engine_capacity]") %>% 
    html_text(TRUE) 

current_df$fuel_type = html_nodes(out, css="[data-code=fuel_type]") %>% 
    html_text(TRUE) 

current_df$price = out %>% html_nodes(xpath="//div[@class='offer-price']//span[contains(@class, 'number')]") %>% 
    html_text(TRUE) 

current_df$link = out %>% html_nodes(css = "div.offer-item__title h2 > a") %>% 
    html_text(TRUE) %>% 
    str_replace_all("[\n\r]", "") 

current_df$offer_details = out %>% html_nodes("div.offer-item__title") %>% 
    html_node("h3") %>% 
    html_text(TRUE)

來源

2017-04-29 12:23:57

謝謝米羅斯瓦夫。您的意見和建議絕對具有巨大的附加價值。儘快回覆最終結果（可行）。 –

rvest網頁內容報廢問題/車交易網站

回答

相關問題