將Facebook htm文件轉換爲R中的數據框

我正在嘗試將我的Facebook聊天消息從.htm文件提取到適當的數據框中。 Rvest通過將html節點（用戶，元，p）提取到矢量然後df，很好地服務了我。然而，我被困在這部分：將Facebook htm文件轉換爲R中的數據框

<div class="thread"> 
    John, My Name" 
    <div class="message"> 
     <div class="message_header"> 
      <span class="user">My Name</span> 
      <span class="meta">Thursday, April 9, 2015 at 12:55am UTC+07</span> 
     </div> 
    </div> 
    <p>Hello, how are you today</p> 


//Other <div class = "message"> 
//Other <div class = "thread">

「線程」標記我與某人的對話，「消息」顯示我的消息。有時候，類「用戶」只顯示「我的名字」，而不是「約翰」或「傑克」，我需要提取字符串「John，My name」作爲另一個變量，並忽略後續嵌套的「消息中的所有文本「班。

我懷疑這是我無能爲力的正則表達式。我也嘗試使用Xpath作爲html_nodes，但/html/body/div[**x**]/div[**y**]/div[**z**]/text()不允許我動態更改xpaths以讀取所有線程類（x，y，z不同，它是160mb htm文件）。

任何幫助表示讚賞！

編輯：我的代碼：

library(rvest) 
library(XML) 
url <- read_html("messages.htm") 

users<-html_nodes(x = url, css = ".user") %>% html_text() 
date<-html_nodes(x = url, css = ".meta") %>% html_text() 
#Repeat 

df <- cbind(users, date, etc.)  

#Extracting the names of the thread with xpath 
threadget <- function(n){ 
    html_text(html_node(url, xpath = sub("n", n, "/html/body/div[2]/div/div[n]/text()"))) 
} 
for (n in c(seq(1,553,1))){thread[n] = threadget(n)}

來源

2017-02-13 huydinh282

歡迎StackOverflow的後！請確保編輯您的問題以顯示您嘗試過的代碼。你不能用class「thread」來定位div，並直接使用'html_nodes（example，xpath ='* // div [@class =「thread」]/text（）[1]'）？ – Jota

謝謝@Jota！那正是我需要的。你有什麼建議在'xpath'上讀到什麼？ – huydinh282

您可以查看https://www.tutorialspoint.com/xpath/和https://www.w3schools.com/xml/xpath_intro.asp – Jota

這裏是我的代碼實現@Jota建議

#Finding the length of each thread for looping using html_children() and length() 
list <- html_nodes(url, css = ".thread") 
count <- sapply(list, html_children) 
threadlength <- sapply(count, length) 
#Extracting the names of the thread using xpath 
threadlist <- html_nodes(url2, xpath = '*//div[@class = "thread"]/text()[1]') %>% html_text() 

#Creating the thread column 
#x indicates how many rows a thread topic should be duplicated. 
#y is used to subset the thread column. 
#z is used to close the inner loop, moving to the next thread topic 
thread <- c() 
n <- 0 
y <- 0 
for (x in threadlength) { 
    z <- 0 
    n <- n+1 
    repeat{ 
    y <- y+1 
    z <- z+1 
    thread[y] <- threadlist[n] 
    if (z == x){ 
     break 
    } 
    } 
}

來源

2017-02-14 03:10:28 huydinh282

將Facebook htm文件轉換爲R中的數據框

回答

相關問題