2016-07-29 98 views
0

如何將具有不同行號或行數的向量組合到R中的數據框中。下面是示例。每個矢量有7或9行。 sourceVersion和設備是額外的兩行。我希望這些數據框中包含這些數據,並將其保留爲空或對於7行向量觀察值設置爲NA,如我在下表中所示。R具有不同行長度的組合向量

我想這樣的數據框中的數據。

type         sourceName    sourceVersion device                           unit creationDate startDate  endDate   value 
HKQuantityTypeIdentifierFlightsClimbed Ryan Praskievicz iPhone 9.3.2   <<HKDevice: 0x15a4af3f0>, name:iPhone, manufacturer:Apple, model:iPhone, hardware:iPhone8,1, software:9.3.2> count 6/2/2016 12:27 6/2/2016 12:09 6/2/2016 12:09 1 
HKQuantityTypeIdentifierStepCount  Ryan Praskievicz iPhone                                 count 10/2/2014 8:30 9/24/2014 15:07 9/24/2014 15:07 7 

這是我試過的。

library(XML) 

xmlstr <- '<?xml version="1.0" encoding="UTF-8"?> 
      <HealthData locale="en_US"> 
       <ExportDate value="2016-06-02 14:05:23 -0400"/> 
       <Me HKCharacteristicTypeIdentifierDateOfBirth="" HKCharacteristicTypeIdentifierBiologicalSex="HKBiologicalSexNotSet" HKCharacteristicTypeIdentifierBloodType="HKBloodTypeNotSet" HKCharacteristicTypeIdentifierFitzpatrickSkinType="HKFitzpatrickSkinTypeNotSet"/> 
       <Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:07:06 -0400" endDate="2014-09-24 15:07:11 -0400" value="7"/> <Record type="HKQuantityTypeIdentifierFlightsClimbed" sourceName="Ryan Praskievicz iPhone" sourceVersion="9.3.2" device="&lt;&lt;HKDevice: 0x15a4af3f0&gt;, name:iPhone, manufacturer:Apple, model:iPhone, hardware:iPhone8,1, software:9.3.2&gt;" unit="count" creationDate="2016-06-02 12:27:46 -0400" startDate="2016-06-02 12:09:29 -0400" endDate="2016-06-02 12:09:29 -0400" value="1"/> </HealthData>' 

xml <- xmlParse(xmlstr) 

recordAttribs <- xpathSApply(doc=xml, path="//HealthData/Record", xmlAttrs) 
df <- data.frame(t(recordAttribs)) 
df 

這是我得到的輸出到R控制檯

 X1 
      1 HKQuantityTypeIdentifierStepCount, Ryan Praskievicz iPhone, count, 2014-10-02 08:30:17 -0400, 2014-09-24 15:07:06 -0400, 2014-09-24 15:07:11 -0400, 7                                                                     
    X2 
1 HKQuantityTypeIdentifierFlightsClimbed, Ryan Praskievicz iPhone, 9.3.2, <<HKDevice: 0x15a4af3f0>, name:iPhone, manufacturer:Apple, model:iPhone, hardware:iPhone8,1, software:9.3.2>, count, 2016-06-02 12:27:46 -0400, 2016-06-02 12:09:29 -0400, 2016-06-02 12:09:29 -0400, 1 
+0

會[這裏](http://webcache.googleusercontent.com/search?q=cache:lPRvnOOSAgoJ:www.inside-r.org/packages/ cran/qpcR/docs/cbind.na +&cd = 4&hl = en&ct = clnk&gl = us)做你想找的事情? –

+0

首先,您嘗試綁定具有不同列數的行,而不是綁定具有不同行數的列。這就是說,你不會,一般有一個列對齊的問題?也就是說,如果一行的列數少於另一行,那麼除非您可以以某種方式從數據中推斷出這些列,否則您怎麼知道哪些列丟失? – aichao

+0

@aichao似乎缺少了相同的兩行 - sourceVersion和device。 – Warner

回答

1

這是一種使用sapplylapply做到這一點。

recordAttribs <- xpathSApply(doc=xml, path="//HealthData/Record", xmlAttrs) 

recordAttribs <- t(recordAttribs) 

使用sapply根據您的列表元素是否等於7

short.condition <- sapply(recordAttribs, function(x) length(x)==7) 

使用lapply你的列表的子集能夠滿足這一條件得到TRUE/FALSE的向量。這個時候你是連接兩個NA滿足上述條件的載體中:

recordAttribs[short.condition] <- lapply(recordAttribs, 
             function(x) c(x[1:2],NA,NA,x[3:7])) 

將其轉換爲形式,你想要一個data.frame:

df <- matrix(unlist(recordAttribs), 
      nrow=2,ncol=9, byrow=TRUE) 

df <- data.frame(df, stringsAsFactors=FALSE) 

names(df) <- c("type","sourceName","sourceVersion","device","unit","creationDate","startDate","endDate","value") 

,看起來像這樣:

> str(df) 
'data.frame': 2 obs. of 9 variables: 
$ type   : chr "HKQuantityTypeIdentifierStepCount" "HKQuantityTypeIdentifierFlightsClimbed" 
$ sourceName : chr "Ryan Praskievicz iPhone" "Ryan Praskievicz iPhone" 
$ sourceVersion: chr NA "9.3.2" 
$ device  : chr NA "<<HKDevice: 0x15a4af3f0>, name:iPhone, manufacturer:Apple, model:iPhone, hardware:iPhone8,1, software:9.3.2>" 
$ unit   : chr "count" "count" 
$ creationDate : chr "2014-10-02 08:30:17 -0400" "2016-06-02 12:27:46 -0400" 
$ startDate : chr "2014-09-24 15:07:06 -0400" "2016-06-02 12:09:29 -0400" 
$ endDate  : chr "2014-09-24 15:07:11 -0400" "2016-06-02 12:09:29 -0400" 
$ value  : chr "7" "1" 
+0

感謝您的答案,但這不完全是我在找的東西。我希望數據框中的數據像我的問題的第一個表格中的「我希望像這樣的數據框中的數據」。 –

+0

@RyanPraskievicz我編輯了我的答案來解決這個問題。這不是最漂亮的解決方案。我假設在7行中你的觀測值會丟失相同的兩列。 – Warner

+0

@RyanPraskievicz做了一個更多的編輯將輸出放入一個有用的data.frame中。 – Warner

2

的依賴是有點深奧,但你可以這樣做:

library(data.table) 
rbindlist(lapply(recordAttribs, function(x) data.table(t(x))), fill=TRUE) 

這將返回data.table,它繼承data.frame

         type    sourceName unit 
1:  HKQuantityTypeIdentifierStepCount Ryan Praskievicz iPhone count 
2: HKQuantityTypeIdentifierFlightsClimbed Ryan Praskievicz iPhone count 
       creationDate     startDate     endDate value 
1: 2014-10-02 08:30:17 -0400 2014-09-24 15:07:06 -0400 2014-09-24 15:07:11 -0400  7 
2: 2016-06-02 12:27:46 -0400 2016-06-02 12:09:29 -0400 2016-06-02 12:09:29 -0400  1 
    sourceVersion 
1:   NA 
2:   9.3.2 
                             device 
1:                           NA 
2: <<HKDevice: 0x15a4af3f0>, name:iPhone, manufacturer:Apple, model:iPhone, hardware:iPhone8,1, software:9.3.2> 

我使用data.table的原因是,它擁有智能rbind方法有use.names=TRUE選項,允許長度不等的行,匹配上的名字沒有列的位置,並與NA填充缺失值。如何rbind.data.table作品

簡單的例子:

d1 = data.table(a="foo", b = "bar", c = "baz") 
d2 = data.table(b="bar", a = "foo") 
rbind(d1, d2) # throws helpful error: "If instead you need to fill missing columns, use set argument 'fill' to TRUE." 
rbind(d1, d2, fill=TRUE) 
#  a b c 
# 1: foo bar baz 
# 2: foo bar NA 
+0

此作品非常感謝!當我嘗試運行'df <-do.call(rbind,c(lapply(recordAttribs,function(x)data.table(t(x))),fill = TRUE))' recordAttribs'是一個大型列表(405677個元素,464 MB),需要很長時間才能運行。有關如何利用更大的數據集來改善這一點的任何想法? –

+0

@RyanPraskievicz請嘗試上面編輯的'rbindlist'。如果'lapply'真的把你拖下來,你可能想看看'multicore :: mclapply' – C8H10N4O2

相關問題