2017-02-13 71 views
0

我正在嘗試將我的Facebook聊天消息從.htm文件提取到適當的數據框中。 Rvest通過將html節點(用戶,元,p)提取到矢量然後df,很好地服務了我。然而,我被困在這部分:將Facebook htm文件轉換爲R中的數據框

<div class="thread"> 
    John, My Name" 
    <div class="message"> 
     <div class="message_header"> 
      <span class="user">My Name</span> 
      <span class="meta">Thursday, April 9, 2015 at 12:55am UTC+07</span> 
     </div> 
    </div> 
    <p>Hello, how are you today</p> 


//Other <div class = "message"> 
//Other <div class = "thread"> 

「線程」標記我與某人的對話,「消息」顯示我的消息。有時候,類「用戶」只顯示「我的名字」,而不是「約翰」或「傑克」,我需要提取字符串「John,My name」作爲另一個變量,並忽略後續嵌套的「消息中的所有文本「班。

我懷疑這是我無能爲力的正則表達式。我也嘗試使用Xpath作爲html_nodes,但/html/body/div[**x**]/div[**y**]/div[**z**]/text()不允許我動態更改xpaths以讀取所有線程類(x,y,z不同,它是160mb htm文件)。

任何幫助表示讚賞!

編輯:我的代碼:

library(rvest) 
library(XML) 
url <- read_html("messages.htm") 

users<-html_nodes(x = url, css = ".user") %>% html_text() 
date<-html_nodes(x = url, css = ".meta") %>% html_text() 
#Repeat 

df <- cbind(users, date, etc.)  

#Extracting the names of the thread with xpath 
threadget <- function(n){ 
    html_text(html_node(url, xpath = sub("n", n, "/html/body/div[2]/div/div[n]/text()"))) 
} 
for (n in c(seq(1,553,1))){thread[n] = threadget(n)} 
+2

歡迎StackOverflow的後!請確保編輯您的問題以顯示您嘗試過的代碼。你不能用class「thread」來定位div,並直接使用'html_nodes(example,xpath ='* // div [@class =「thread」]/text()[1]') ? – Jota

+0

謝謝@Jota!那正是我需要的。你有什麼建議在'xpath'上讀到什麼? – huydinh282

+0

您可以查看https://www.tutorialspoint.com/xpath/和https://www.w3schools.com/xml/xpath_intro.asp – Jota

回答

0

這裏是我的代碼實現@Jota建議

#Finding the length of each thread for looping using html_children() and length() 
list <- html_nodes(url, css = ".thread") 
count <- sapply(list, html_children) 
threadlength <- sapply(count, length) 
#Extracting the names of the thread using xpath 
threadlist <- html_nodes(url2, xpath = '*//div[@class = "thread"]/text()[1]') %>% html_text() 

#Creating the thread column 
#x indicates how many rows a thread topic should be duplicated. 
#y is used to subset the thread column. 
#z is used to close the inner loop, moving to the next thread topic 
thread <- c() 
n <- 0 
y <- 0 
for (x in threadlength) { 
    z <- 0 
    n <- n+1 
    repeat{ 
    y <- y+1 
    z <- z+1 
    thread[y] <- threadlist[n] 
    if (z == x){ 
     break 
    } 
    } 
}