2017-02-15 176 views
-1

我想提取一個.docx文件chaptervise的內容。 所以,我.docx文檔具有寄存器,每章有一些內容.docx文件章節提取

1. Intro 
    some text about Intro, these things, those things 
2. Special information 
    these information are really special 
    2.1 General information about the environment 
     environment should be also important 
    2.2 Further information 
     and so on and so on 

所以最後這將是巨大的接收Nx3矩陣,包含索引號,索引名和至少內容。

i_number  i_name     content 
1   Intro     some text about Intro, these things, those things 
2   Special Information these information are really special 
... 

感謝您的幫助

+0

R或Python解決方案適合您嗎? –

+0

相反在R中,但Python也是可能的。 –

回答

0

你可以導出或複製粘貼您的.docx在一個.txt並應用該R腳本:

library(stringr) 
library(readr) 

doc <- read_file("filename.txt") 

pattern_chapter <- regex("(\\d+\\.)(.{4,100}?)(?:\r\n)", dotall = T) 

i_name <- str_match_all(doc, pattern_chapter)[[1]][,1] 
paragraphs <- str_split(doc, pattern_chapter)[[1]] 
content <- paragraphs[-which(paragraphs=="")] 

result <- data.frame(i_name, content) 
result$i_number <- seq.int(nrow(result)) 

View(result) 

它不會,如果你的工作文件包含任何類型的行,不是以數字開頭的標題(例如,腳註或編號列表)

(請不要盲目downvote:此腳本完全適用於給出的示例)