從XML節點解析特定值

使用R和XML包，我使用XML htmlParse函數解析了一個（「HTMLInternalDocument」「HTMLInternalDocument」「XMLInternalDocument」「XMLAbstractDocument」）對象。下面是我感興趣的xml對象中的行，其中包含兩個我希望返回的值。從XML節點解析特定值

除了來自class = gsc_1usr_name（返回「Konrad Wrzecionkowski」）的值之外，我需要將「user =」下的值，在本例中爲「QnVgFlYAAAAJ」。我用xpathSApply嘗試了幾種語法變體，它總是返回NULL。無可否認，當談到xml時，我非常無知，有什麼想法？有沒有辦法我可以強制這個不同的對象類，如列表，然後在矢量上使用拆分？標準強制（例如，as.list，as.character）似乎不適用於此對象類。

search.page <- "http://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=GVN Powell World Wildlife Fund" 
x <- XML::htmlParse(search.page, encoding="UTF-8")

它返回一個XML對象，下面是一個單一的條目的子集，選自10. h3 class="gsc_1usr_name行包含值中的每個條目，即我想檢索（對於所有10）。

</div> 
</div> 
<div class="gsc_1usr gs_scl"> 
<div class="gsc_1usr_photo"><a href="/citations?user=QnVgFlYAAAAJ&amp;hl=en&amp;oe=ASCII"><img src="/citations?view_op=view_photo&amp;user=QnVgFlYAAAAJ&amp;citpid=3" sizes="(max-width:599px) 75px,(max-width:1251px) 100px, 120px" srcset="/citations?view_op=view_photo&amp;user=QnVgFlYAAAAJ&amp;citpid=3 128w,/citations?view_op=medium_photo&amp;user=QnVgFlYAAAAJ&amp;citpid=3 256w" alt="Konrad Wrzecionkowski"></a></div> 
<div class="gsc_1usr_text"> 
<h3 class="gsc_1usr_name"><a href="/citations?user=QnVgFlYAAAAJ&amp;hl=en&amp;oe=ASCII">Konrad Wrzecionkowski</a></h3> 
<div class="gsc_1usr_aff">Zachodniopomorski Uniwersytet Technologiczny w Szczecinie, BÅ‚Ä™kitny Patrol <span class="gs_hlt">WWF </span>Polska</div> 
<div class="gsc_1usr_eml">Verified email at <span class="gs_hlt">wwf</span>.pl</div> 
<div class="gsc_1usr_emlb">@wwf.pl</div> 
<div class="gsc_1usr_int"> 
<a class="gsc_co_int" href="/citations?view_op=search_authors&amp;hl=en&amp;oe=ASCII&amp;mauthors=label:ichtiologia_ochrona_przyrody">ichtiologia/ochrona przyrody</a> </div> 
</div> 
</div>

使用的xpathSApply功能我回來「南越政府鮑威爾」，但也想從用戶的值=以下語法。我已經嘗試了h3 [@ user ='']的變體，包括類的子查詢，但無法獲得其他任何內容。

XML::xpathSApply(x, "//h3[@class='gsc_1usr_name']", xmlValue)

我一直在使用的方法是使用url和readLines。然後我使用strsplit來拉取所需的值。

auth.names <- "Konrad Wrzecionkowski WWF"  
search.page <- paste("http://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=", auth.names, sep="") 

x <- readLines(url(search.page)) 
x <- strsplit(x[[1]], split="user=")[[1]][2] 
x <- strsplit(x, split="&amp;")[[1]][1]

這裏的問題是，谷歌學術搜索似乎並不喜歡的網頁抓取和代碼定期失敗，出現「無法打開連接，HTTP狀態爲「503 Service Unavailable」錯誤。但是，這似乎並不是htmlParse的情況。

來源

2017-08-03 Jeffrey Evans

你也許可以搶'HREF ='從''標籤屬性 - 做'xpathSApply（X，「// A」，xmlGetAttr，「HREF」）'工作？ – thelatemail

不幸的是，這個語法不起作用。我在XML中有10個條目，我想要爲其檢索值。我修改了我的帖子以提供更多關於xml的細節。 –

刮Google是違反他們的ToS。 – hrbrmstr

library(rvest) 
library(magrittr) 

url <- "http://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=GVN Powell World Wildlife Fund" 
xpath = "//*[@id=\"gsc_ccl\"]/div[1]/div[2]/h3/a/span" 

gvn.powell <- url %>% 
    read_html %>% 
    html_nodes(xpath = xpath) %>% 
    html_text 

gvn.powell 
#[1] "GVN Powell"

來源

2017-08-04 00:30:40

從XML節點解析特定值

回答

相關問題