xmlValue中的參數錯誤是什麼？

我是R新手。我想刮多頁html頁面，並創建一個數據集，其中包含通過XPath包含特定數據的列。我發現一個有用的刮tutorial。xmlValue中的參數錯誤是什麼？

我的計劃是遵循鏈接中的腳本，使其工作/理解，然後定製到我的網站/ html/xpath。

然而，當我在代碼運行的第二塊（刮痧的博客文章），我得到這個錯誤：

Error in UseMethod("xmlValue") : no applicable method for 'xmlValue' applied to an object of class "xml_node".

這是打破代碼行：

pages<-sapply(pages,xmlValue)

頁面變量包含一個節點集：

{xml_nodeset (1)} 
[1] <span class="pages">Page 1 of 25</span>

我假設臨時t xmlValue不能應用於此數據類型或此類性質。

由於本教程中的代碼適用於作者，我可能錯過了某些顯而易見的內容，或者存在庫加載順序和相關功能遮罩的問題。（雖然我玩過）。

任何建議或協助非常感謝。

來源

2016-11-28 Maga Karaev

可能是一個RCurl問題，因爲XML包工作正常：'DOC < - htmlParse（readlines方法（theURL））; xpathSApply（doc，'// * [@ id =「leftcontent」]/div [11]/span [1]'，xmlValue）; ＃[1]「第1頁25」' – Parfait

@Parfait謝謝。這裏是一塊codetheURL的< - 「http://www.r-bloggers.com/search/web%20scraping」 page_data < - HTML（theURL）＃獲取的頁的總數頁數< - page_data％>％ html_nodes（xpath ='// * [@ id =「leftcontent」]/div [11]/span [1]'） pages <-sapply（pages，xmlValue）// code the rendering error on這行代碼。 –

我完全理解代碼來自哪裏。這個錯誤可能是一個RCurl問題，因爲正如我上面顯示的那樣，精確的XPath在XML中起作用。 – Parfait

考慮XML作爲你唯一需要包xpathSApply電話：

library(XML) 

theURL <- "http://www.r-bloggers.com/search/web%20scraping"  
page_data <- htmlParse(readLines(theURL, warn = FALSE)) 
pages <- xpathSApply(doc, '//*[@id="leftcontent"]/div[11]/span[1]', xmlValue) 
pages <- as.numeric(regmatches(pages, regexpr("[0-9]+$", pages))) 

scrape_r_bloggers_page <- function(doc, page){ 

    titles <- xpathSApply(doc, '//div[contains(@id,"post")]/h2/a', xmlValue) 
    descriptions <- xpathSApply(doc, '//div[contains(@id,"post")]/div[2]/p[1]', xmlValue) 
    dates <- xpathSApply(doc, '//div[contains(@id,"post")]/div[1]/div', xmlValue) 
    authors <- xpathSApply(doc, '//div[contains(@id,"post")]/div[1]/a', xmlValue) 
    urls <- xpathSApply(doc, '//div[contains(@id,"post")]/h2/a', xmlValue) 

    blog_posts_df <- data.frame(title=titles, 
           description=descriptions, 
           author=authors, 
           date=dates, 
           url=urls, 
           page=page)  
} 

blogsdf <- scrape_r_bloggers_page(page_data, 1) 

blogsList <- lapply(c(2:(pages-1)), function (page) { 
    Sys.sleep(1) 
    theURL <- paste("http://www.r-bloggers.com/search/web%20scraping/page/",page,"/",sep="") 
    page_data <- htmlParse(readLines(theURL, warn = FALSE)) 
    scrape_r_bloggers_page(page_data, page) 
}) 

finaldf <- rbind(blogsdf, do.call(rbind, blogsList))

來源

2016-11-30 04:17:16 Parfait

@ Parfait，謝謝，它的效果很好。我所能建議的是doc參數應該被page_name變量替換。此外，URL Xpath正在檢索標題。 Aweome！ –

該「教程」是rvest和XML的奇怪組合。如果使用rvest，那麼請使用該包中的函數，如html_text。 xml2包也適用於rvest，但不是XML。來自html的警告消息也應該告訴你它已過時。

page_data <- html(theURL) 
##Warning message: 'html' is deprecated. 

page_data %>% 
    html_nodes(xpath='//*[@id="leftcontent"]/div[11]/span[1]') %>% 
    html_text 
[1] "Page 1 of 25"

來源

2016-11-29 22:26:28

S，謝謝。這確實是一個艱難的軟件包組合。上面的評論者提供了一個很好的解決方法。 –

xmlValue中的參數錯誤是什麼？

回答

相關問題