從R中鏈接中提取標題

我正在使用R中的rvest軟件包來練習網絡抓取。到目前爲止，該頁面已經是一個很好的指南。（http://zevross.com/blog/2015/05/19/scrape-website-data-with-the-new-r-package-rvest/）。使用工具Selector Gadget我可以識別我想要的項目的類或div元素引用（據我所知）。從R中鏈接中提取標題

所以我剛去維基百科，並試圖提取美國總統名單。該頁面的鏈接是https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States。 Selector Gadget告訴我元素類/ div /？（不知道該怎麼稱呼它）是「大」。

這裏是我到目前爲止的代碼：

site = read_html("https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States") 
fnames = html_nodes(site,"big a")

和部分輸出爲：

{xml_nodeset (44)} 
[1] <a href="/wiki/George_Washington" title="George Washington">George Washington</a> 
[2] <a href="/wiki/John_Adams" title="John Adams">John Adams</a> 
[3] <a href="/wiki/Thomas_Jefferson" title="Thomas Jefferson">Thomas Jefferson</a> 
[4] <a href="/wiki/James_Madison" title="James Madison">James Madison</a> 
[5] <a href="/wiki/James_Monroe" title="James Monroe">James Monroe</a> 
[6] <a href="/wiki/John_Quincy_Adams" title="John Quincy Adams">John Quincy Adams</a> 
[7] <a href="/wiki/Andrew_Jackson" title="Andrew Jackson">Andrew Jackson</a> 
[8] <a href="/wiki/Martin_Van_Buren" title="Martin Van Buren">Martin Van Buren</a>

太好了！所以我已經提取了鏈接的名字！我只是想要名字，所以我不知道如何在這裏繼續。有沒有辦法輕鬆獲取鏈接html代碼之間的名稱？或者我應該使用html_nodes函數來獲取另一個元素嗎？我覺得我很接近！

謝謝你的幫助。

來源

2016-06-07 user137698

HTML_TEXT'（fnames）'應該這樣做。 – cory

頭腦被炸燬。這工作！非常感謝！！！ – user137698

或...'html_attr（fnames，「title」）' – cory

名稱有兩個來源。標題屬性和文本。它們的格式可能稍有不同，或者可能包含中間首字母或其他。使用你最喜歡的那個。

html_attr(fnames, "title")

html_text(fnames)

來源

2016-06-07 19:13:30 cory

從R中鏈接中提取標題

回答

相關問題