2017-07-26 34 views
1

我目前正在嘗試閱讀希臘戲劇,這些戲劇作爲XML文件在線提供給對話和演講者專欄的數據框中。 我運行以下命令來下載XML並解析對話和揚聲器。用演講者和對話解析古希臘戲劇的XML

library(XML) 
library(RCurl) 
url <- "http://www.perseus.tufts.edu/hopper/dltext?doc=Perseus%3Atext%3A1999.01.0186" 
html <- getURL(url, followlocation = TRUE) 
doc <- htmlParse(html, asText=TRUE) 
plain.text <- xpathSApply(doc, "//p", xmlValue) 
speakersc <- xpathSApply(doc, "//speaker", xmlValue) 
dialogue <- data.frame(text = plain.text, stringsAsFactors = FALSE) 
speakers <- data.frame(text = speakersc, stringsAsFactors = FALSE) 

但是,我後來遇到了一個問題。對話將產生300行(對於劇中的300條不同線),但發言者將產生297. 問題的原因是由於下面轉載的XML的結構,其中<speaker>標記不被重複用於繼續對話被舞臺方向打斷。因爲我必須將對話 與<p>標記分開,所以它會產生兩個對話行,但只有一個揚聲器行,而不會相應地複製揚聲器。

<speaker>克里昂</speaker>

<stage>到保護。 </stage>

-<p>

可以爲自己,無論你請,

<milestone n="445" unit="line" ed="p"/>

自由和清晰重收費。

<stage>退出警衛。 </stage>

</p>

</sp>

-<sp>

<stage>要安提戈涅。 </stage>

<p>然而,你告訴我 - 不是簡要地,但是簡要地說 - 你知道一個詔書禁止這個嗎? </p>

</sp>

我如何解析XML這樣的數據將正確地產生相同數量的對話行的相同數目對應的揚聲器行的?

對於上面的例子,我希望得到的數據框要麼包含Creon對話框中對應於舞臺方向前後的兩行對話的兩行,要麼將一行將Creon的對話視爲一行忽略由於舞臺方向的分離。

非常感謝您的幫助。

回答

1

考慮使用XPath的前瞻性following-sibling尋找下一個<p>標籤時,揚聲器是空的,同時還能通過<sp>迭代這是父<speaker><p>

# ALL SP NODES 
sp <- xpathSApply(doc, "//body/descendant::sp", xmlValue) 

# ITERATE THROUGH EACH SP BY NODE INDEX TO CREATE LIST OF DFs 
dfList <- lapply(seq_along(sp), function(i){ 
    data.frame(
    speakers = xpathSApply(doc, paste0("concat(//body/descendant::sp[",i,"]/speaker,'')"), xmlValue), 
    dialogue = xpathSApply(doc, paste0("concat(//body/descendant::sp[",i,"]/speaker/following-sibling::p[1], ' ', 
               //body/descendant::sp[position()=",i+1," and not(speaker)]/p[1])"), xmlValue) 
) 

# ROW BIND LIST OF DFs AND SUBSET EMPTY SPEAKER/DIALOGUE 
finaldf <- subset(do.call(rbind, dfList), speakers!="" & dialogue!="") 
}) 

# SPECIFIC ROWS IN OP'S HIGHLIGHT 
finaldf[85,] 
# speakers 
# 85 Creon 
# 
# dialogue 
# 85 You can take yourself wherever you please,free and clear of a heavy 
# charge.Exit Guard. You, however, tell meâ€」not at length, but 
# brieflyâ€」did you know that an edict had forbidden this? 

finaldf[86,] 
# speakers          dialogue 
# 87 Antigone I knew it. How could I not? It was public. 

Dataframe Output

+0

非常感謝您的幫助。該代碼完美地工作,併產生正確的東西,我需要一個小的修改})移動到創建finaldf對象的上方。非常感謝您的工作! – jmlawler

0

另一種選擇是在解析XML之前讀取URL並進行一些更新,在這種情況下,用空格替換里程碑標記以避免將單詞混合在一起,刪除階段標記,然後修復沒有揚聲器的sp節點

x <- readLines(url) 
x <- gsub("<milestone[^>]*>", " ", x) # add space 
x <- gsub("<stage>[^>]*stage>", "", x) # no space 
x <- paste(x, collapse = "") 
x <- gsub("</p></sp><sp><p>", "", x) # fix sp without speaker 

現在XML具有相同數量的sp和揚聲器標籤。

doc <- xmlParse(x) 
summary(doc) 
    p    sp   speaker   div2  placeName 
299    297    297    51    25 ... 

最後,得到sp節點和解析揚聲器和段落。

sp <- getNodeSet(doc, "//sp") 
s1 <- sapply(sp, xpathSApply, ".//speaker", xmlValue) 
# collapse the 1 node with 2 <p> 
p1 <- lapply(sp, xpathSApply, ".//p", xmlValue) 
p1 <- trimws(sapply(p1, paste, collapse= " ")) 
speakers <- data.frame(speaker=s1, dialogue = p1) 

    speaker                 dialogue 
1 Antigone Ismene, my sister, true child of my own mother, do you know any evil o... 
2 Ismene To me no word of our friends, Antigone, either bringing joy or bringin... 
3 Antigone I knew it well, so I was trying to bring you outside the courtyard gat... 
4 Ismene Hear what? It is clear that you are brooding on some dark news.   
5 Antigone Why not? Has not Creon destined our brothers, the one to honored buri... 
6 Ismene Poor sister, if things have come to this, what would I profit by loose... 
7 Antigone Consider whether you will share the toil and the task.     
8 Ismene What are you hazarding? What do you intend?        
9 Antigone Will you join your hand to mine in order to lift his corpse?    
10 Ismene You plan to bury him—when it is forbidden to the city?  
... 
+0

您的代碼對我來說工作得非常好 - 感謝您的解決方案!感謝您的幫助。 – jmlawler