R具有不同行長度的組合向量

如何將具有不同行號或行數的向量組合到R中的數據框中。下面是示例。每個矢量有7或9行。 sourceVersion和設備是額外的兩行。我希望這些數據框中包含這些數據，並將其保留爲空或對於7行向量觀察值設置爲NA，如我在下表中所示。R具有不同行長度的組合向量

我想這樣的數據框中的數據。

type         sourceName    sourceVersion device                           unit creationDate startDate  endDate   value 
HKQuantityTypeIdentifierFlightsClimbed Ryan Praskievicz iPhone 9.3.2   <<HKDevice: 0x15a4af3f0>, name:iPhone, manufacturer:Apple, model:iPhone, hardware:iPhone8,1, software:9.3.2> count 6/2/2016 12:27 6/2/2016 12:09 6/2/2016 12:09 1 
HKQuantityTypeIdentifierStepCount  Ryan Praskievicz iPhone                                 count 10/2/2014 8:30 9/24/2014 15:07 9/24/2014 15:07 7

這是我試過的。

library(XML) 

xmlstr <- '<?xml version="1.0" encoding="UTF-8"?> 
      <HealthData locale="en_US"> 
       <ExportDate value="2016-06-02 14:05:23 -0400"/> 
       <Me HKCharacteristicTypeIdentifierDateOfBirth="" HKCharacteristicTypeIdentifierBiologicalSex="HKBiologicalSexNotSet" HKCharacteristicTypeIdentifierBloodType="HKBloodTypeNotSet" HKCharacteristicTypeIdentifierFitzpatrickSkinType="HKFitzpatrickSkinTypeNotSet"/> 
       <Record type="HKQuantityTypeIdentifierStepCount" sourceName="Ryan Praskievicz iPhone" unit="count" creationDate="2014-10-02 08:30:17 -0400" startDate="2014-09-24 15:07:06 -0400" endDate="2014-09-24 15:07:11 -0400" value="7"/> <Record type="HKQuantityTypeIdentifierFlightsClimbed" sourceName="Ryan Praskievicz iPhone" sourceVersion="9.3.2" device="&lt;&lt;HKDevice: 0x15a4af3f0&gt;, name:iPhone, manufacturer:Apple, model:iPhone, hardware:iPhone8,1, software:9.3.2&gt;" unit="count" creationDate="2016-06-02 12:27:46 -0400" startDate="2016-06-02 12:09:29 -0400" endDate="2016-06-02 12:09:29 -0400" value="1"/> </HealthData>' 

xml <- xmlParse(xmlstr) 

recordAttribs <- xpathSApply(doc=xml, path="//HealthData/Record", xmlAttrs) 
df <- data.frame(t(recordAttribs)) 
df

這是我得到的輸出到R控制檯

 X1 
      1 HKQuantityTypeIdentifierStepCount, Ryan Praskievicz iPhone, count, 2014-10-02 08:30:17 -0400, 2014-09-24 15:07:06 -0400, 2014-09-24 15:07:11 -0400, 7                                                                     
    X2 
1 HKQuantityTypeIdentifierFlightsClimbed, Ryan Praskievicz iPhone, 9.3.2, <<HKDevice: 0x15a4af3f0>, name:iPhone, manufacturer:Apple, model:iPhone, hardware:iPhone8,1, software:9.3.2>, count, 2016-06-02 12:27:46 -0400, 2016-06-02 12:09:29 -0400, 2016-06-02 12:09:29 -0400, 1

來源

2016-07-29 Ryan Praskievicz

會[這裏]（http://webcache.googleusercontent.com/search?q=cache:lPRvnOOSAgoJ:www.inside-r.org/packages/ cran/qpcR/docs/cbind.na +＆cd = 4＆hl = en＆ct = clnk＆gl = us）做你想找的事情？ –

首先，您嘗試綁定具有不同列數的行，而不是綁定具有不同行數的列。這就是說，你不會，一般有一個列對齊的問題？也就是說，如果一行的列數少於另一行，那麼除非您可以以某種方式從數據中推斷出這些列，否則您怎麼知道哪些列丟失？ – aichao

@aichao似乎缺少了相同的兩行 - sourceVersion和device。 – Warner

這是一種使用sapply和lapply做到這一點。

recordAttribs <- xpathSApply(doc=xml, path="//HealthData/Record", xmlAttrs) 

recordAttribs <- t(recordAttribs)

使用sapply根據您的列表元素是否等於7

short.condition <- sapply(recordAttribs, function(x) length(x)==7)

使用lapply你的列表的子集能夠滿足這一條件得到TRUE/FALSE的向量。這個時候你是連接兩個NA滿足上述條件的載體中：

recordAttribs[short.condition] <- lapply(recordAttribs, 
             function(x) c(x[1:2],NA,NA,x[3:7]))

將其轉換爲形式，你想要一個data.frame：

df <- matrix(unlist(recordAttribs), 
      nrow=2,ncol=9, byrow=TRUE) 

df <- data.frame(df, stringsAsFactors=FALSE) 

names(df) <- c("type","sourceName","sourceVersion","device","unit","creationDate","startDate","endDate","value")

，看起來像這樣：

> str(df) 
'data.frame': 2 obs. of 9 variables: 
$ type   : chr "HKQuantityTypeIdentifierStepCount" "HKQuantityTypeIdentifierFlightsClimbed" 
$ sourceName : chr "Ryan Praskievicz iPhone" "Ryan Praskievicz iPhone" 
$ sourceVersion: chr NA "9.3.2" 
$ device  : chr NA "<<HKDevice: 0x15a4af3f0>, name:iPhone, manufacturer:Apple, model:iPhone, hardware:iPhone8,1, software:9.3.2>" 
$ unit   : chr "count" "count" 
$ creationDate : chr "2014-10-02 08:30:17 -0400" "2016-06-02 12:27:46 -0400" 
$ startDate : chr "2014-09-24 15:07:06 -0400" "2016-06-02 12:09:29 -0400" 
$ endDate  : chr "2014-09-24 15:07:11 -0400" "2016-06-02 12:09:29 -0400" 
$ value  : chr "7" "1"

來源

2016-07-29 18:01:03 Warner

感謝您的答案，但這不完全是我在找的東西。我希望數據框中的數據像我的問題的第一個表格中的「我希望像這樣的數據框中的數據」。 –

@RyanPraskievicz我編輯了我的答案來解決這個問題。這不是最漂亮的解決方案。我假設在7行中你的觀測值會丟失相同的兩列。 – Warner

@RyanPraskievicz做了一個更多的編輯將輸出放入一個有用的data.frame中。 – Warner

的依賴是有點深奧，但你可以這樣做：

library(data.table) 
rbindlist(lapply(recordAttribs, function(x) data.table(t(x))), fill=TRUE)

這將返回data.table，它繼承data.frame。

         type    sourceName unit 
1:  HKQuantityTypeIdentifierStepCount Ryan Praskievicz iPhone count 
2: HKQuantityTypeIdentifierFlightsClimbed Ryan Praskievicz iPhone count 
       creationDate     startDate     endDate value 
1: 2014-10-02 08:30:17 -0400 2014-09-24 15:07:06 -0400 2014-09-24 15:07:11 -0400  7 
2: 2016-06-02 12:27:46 -0400 2016-06-02 12:09:29 -0400 2016-06-02 12:09:29 -0400  1 
    sourceVersion 
1:   NA 
2:   9.3.2 
                             device 
1:                           NA 
2: <<HKDevice: 0x15a4af3f0>, name:iPhone, manufacturer:Apple, model:iPhone, hardware:iPhone8,1, software:9.3.2>

我使用data.table的原因是，它擁有智能rbind方法有use.names=TRUE選項，允許長度不等的行，匹配上的名字沒有列的位置，並與NA填充缺失值。如何rbind.data.table作品

簡單的例子：

d1 = data.table(a="foo", b = "bar", c = "baz") 
d2 = data.table(b="bar", a = "foo") 
rbind(d1, d2) # throws helpful error: "If instead you need to fill missing columns, use set argument 'fill' to TRUE." 
rbind(d1, d2, fill=TRUE) 
#  a b c 
# 1: foo bar baz 
# 2: foo bar NA

來源

2016-07-29 18:27:36 C8H10N4O2

此作品非常感謝！當我嘗試運行'df <-do.call（rbind，c（lapply（recordAttribs，function（x）data.table（t（x））），fill = TRUE））' recordAttribs'是一個大型列表（405677個元素，464 MB），需要很長時間才能運行。有關如何利用更大的數據集來改善這一點的任何想法？ –

@RyanPraskievicz請嘗試上面編輯的'rbindlist'。如果'lapply'真的把你拖下來，你可能想看看'multicore :: mclapply' – C8H10N4O2

R具有不同行長度的組合向量

回答

相關問題