2016-07-26 80 views
1

我試圖使用下面的函數嵌套JSON文件轉換成數據幀R:如何讓R循環更快?

rf1 <- function(data) { 
master <- 
data.frame(
    id = character(0), 
    awardAmount = character(0), 
    awardStatus = character(0), 
    tenderAmount = character(0) 
) 
for (i in 1:nrow(data)) { 
temp1 <- unlist(data$data$awards[[i]]$status) 
length <- length(temp1) 
temp2 <- rep(data$data$id[i], length) 
temp3 <- rep(data$data$value$amount[[i]], length) 
temp4 <- unlist(data$data$awards[[i]]$value[[1]]) 
tempDF <- 
    data.frame(id = temp2, 
       awardAmount = temp4, 
       awardStatus = temp1, 
       tenderAmount = temp3) 
    master <- rbind(master, tempDF) 
    } 
return(master) 
} 

以下是我正在使用JSON文件的一個例子:

{ 
    "data" : { 
     "id" : "3f066cdd81cf4944b42230ed56a35bce", 
     "awards" : [ 
      { 
       "status" : "unsuccessful", 
       "value" : { 
        "amount" : 76 
       } 
      }, 
      { 
       "status" : "active", 
       "value" : { 
        "amount" : 41220 
       } 
      } 
     ], 
     "value" : { 
      "amount" : 48000 
     } 
    } 
}, 
{ 
    "data" : { 
     "id" : "9507162e6ee24cef8e0ea75d46a81a30", 
     "awards" : [ 
      { 
       "status" : "active", 
       "value" : { 
        "amount" : 2650 
       } 
      } 
     ], 
     "value" : { 
      "amount" : 2650 
     } 
    } 
}, 
{ 
    "data" : { 
     "id" : "a516ac43240c4ec689f3392cf0c17575", 
     "awards" : [ 
      { 
       "status" : "active", 
       "value" : { 
        "amount" : 2620 
       } 
      } 
     ], 
     "value" : { 
      "amount" : 2650 
     } 
    } 
} 

由於你可以看到,這三個觀測結果有不同數量的獎勵(第一個觀測有兩個獎項,而另外兩個獎項只有一個)。由於我正在尋找一個表格視圖數據框,我正在用空白單元填充重複信息,如data$iddata$value$amount

json文件大約有10萬個觀測值,所以需要永遠返回一個數據幀(我一直等待超過30分鐘,仍然沒有結果)。我認爲可能有一種方法可以同時運行所有temp行,這應該可以節省大量時間,但我不確定如何在我的代碼中實現這一點。

爲了讓您瞭解我所尋找的輸出,我將我的功能限制在for (i in 1:3)之下,它產生了以下數據幀。我的問題是如何做同樣的事情,但對於100,000個觀察。請注意,json示例對應於示例輸出。

所需的輸出:

Sample Output

+1

使用JSON解析包,如'jsonlite'或'RJSONIO'或'rjson'。 – alistaire

+0

@alistaire謝謝,但我的json文件嵌套太深,所以軟件包不能完成這項工作。實際上,我使用'jsonlite'來返回一個數據幀,但是是一個半json格式。我正在尋找一個經典的表視圖數據框架。 – Misha

+0

在這種情況下,您需要更清楚地呈現您的問題,並提供示例數據。閱讀:http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – alistaire

回答

1

這絕不是優雅的,但它似乎工作:

library(jsonlite) 
library(purrr) 
library(dplyr) 

json_data <- '[{"data":{"id":"3f066cdd81cf4944b42230ed56a35bce","awards":[{"status":"unsuccessful","value":{"amount":76}},{"status":"active","value":{"amount":41220}}],"value":{"amount":48000}}},{"data":{"id":"9507162e6ee24cef8e0ea75d46a81a30","awards":[{"status":"active","value":{"amount":2650}}],"value":{"amount":2650}}},{"data":{"id":"a516ac43240c4ec689f3392cf0c17575","awards":[{"status":"active","value":{"amount":2620}}],"value":{"amount":2650}}}] ' 

# parse original JSON records 
parsed_json_data <- fromJSON(json_data)$data 

# extract awards data, un-nest the nested parts, and re-assemble awards into a data frame for each id 
awards <- map2(.x = parsed_json_data$id, 
       .y = parsed_json_data$awards, 
       .f = function(x, y) bind_cols(data.frame('id' = rep(x, nrow(y)), stringsAsFactors = FALSE), as.data.frame(as.list(y)))) 

# bind together the data frames over all ids 
awards <- 
    bind_rows(awards) %>% 
    rename(awards_status = status, awards_amount = amount) 

# remove awards data from original parsed data 
parsed_json_data$awards <- NULL 

# un-nest the remaining data structures 
parsed_json_data <- as.data.frame(as.list(parsed_json_data), stringsAsFactors = FALSE) 

# join higher-level data with awards data (in denormalisation process) 
final_data_frame <- inner_join(parsed_json_data, awards, by = 'id') 

final_data_frame 
# id        amount awards_status awards_amount 
# 1 3f066cdd81cf4944b42230ed56a35bce 48000 unsuccessful 76 
# 2 3f066cdd81cf4944b42230ed56a35bce 48000   active 41220 
# 3 9507162e6ee24cef8e0ea75d46a81a30 2650   active 2650 
# 4 a516ac43240c4ec689f3392cf0c17575 2650   active 2620 
+0

非常感謝!它確實適用於我的數據集,它非常易讀且乾淨。我不能要求更多!順便說一下,運行你的代碼只需要57.034秒,這對於R和如此大的文件來說非常快。再次感謝! – Misha

+0

謝謝@米莎 - 我的榮幸,但大多數的榮譽屬於哈德利韋翰和co。編寫dplyr和purrr。 –

1

另一種方法是刪除工作表R和重新構建你的mongodb查詢。

如果這是MongoDB中數據

enter image description here

在蒙戈外殼,你可以寫沿

db.json.aggregate([ 
     { "$unwind" : "$data.awards"}, 
     { "$group" : { 
      "_id" : {"id" : "$data.id", "status" : "$data.awards.status"}, 
      "awardAmount" : { "$sum" : "$data.awards.value.amount" }, 
      "tenderAmount" : { "$sum" : "$data.value.amount" } 
      } 
     }, 
     { "$project" : { 
       "id" : "$_id.id", 
       "status" : "$_id.status", 
       "awardAmount" : "$awardAmount", 
       "tenderAmount" : "$tenderAmount", 
       "_id" : 0} } 
    ]) 

(注意線路查詢:我不是一個專家的mongodb ,所以可能會有一個稍微更簡潔的寫作方式)

你也可以在R中使用

library(mongolite) 
mongo <- mongo(collection = "json", db = "test") 

qry <- '[ 
        { "$unwind" : "$data.awards"}, 
        { "$group" : { 
           "_id" : {"id" : "$data.id", "status" : "$data.awards.status"}, 
           "awardAmount" : { "$sum" : "$data.awards.value.amount" }, 
           "tenderAmount" : { "$sum" : "$data.value.amount" } 
          } 
        }, 
        { "$project" : { 
           "id" : "$_id.id", 
           "status" : "$_id.status", 
           "awardAmount" : "$awardAmount", 
           "tenderAmount" : "$tenderAmount", 
           "_id" : 0} 
          } 
        ]' 

df <- mongo$aggregate(pipeline = qry) 
df 
# awardAmount tenderAmount        id  status 
# 1  2620   2650 a516ac43240c4ec689f3392cf0c17575  active 
# 2  41220  48000 3f066cdd81cf4944b42230ed56a35bce  active 
# 3  2650   2650 9507162e6ee24cef8e0ea75d46a81a30  active 
# 4   76  48000 3f066cdd81cf4944b42230ed56a35bce unsuccessful 
+0

謝謝@SymbolixU!它的工作,我會研究如何使我的查詢更有效。 – Misha

+0

很高興我能幫到你。在StackOverflow上最好的做法是對最有用的答案進行投票表決) – SymbolixAU

+0

@SymbolixU我希望我可以,但是我沒有足夠的聲望來提出答案但:( – Misha

1

這可能是最簡單的方法。它不使用JSON解析,但利用了一堆正則表達式

但是,我同意SymbolixAU在mongo查詢中這樣做的方式。

# load json file ("file.json") just as a single string/single-element character vector 
jsonAsString <- readChar("file.json", file.info("file.json")$size) 

# chunk the tenders 
dataChunks <- unlist(strsplit(jsonAsString, '"data" : \\{')) 
dataChunks <- dataChunks[grepl("id", dataChunks)]  # this removes the unnecessary header 

# get the award subchunks 
awardSubChunks <- gsub('.*("awards".*?}.*?}.*?]).*', "\\1", dataChunks) 

    # scrape status values out of the award subchunks 
statusIndex <- gregexpr('(?<="status" : ")([[:alnum:]]*)', awardSubChunks, perl = T) 
status <- unlist(regmatches(awardSubChunks, statusIndex)) 

    # scrape bidAmount value out of the award subchunks 
bidAmountIndex <- gregexpr('(?<="amount" :)([[:alnum:]]*)', awardSubChunks, perl = T) 
bidAmount <- unlist(regmatches(awardSubChunks, bidAmountIndex)) 

# get the id and tender by removing the award subchunks 
idTenderAmount <- gsub('"awards".*?}.*?}.*?]', "", dataChunks) 

    # scrape id and tenderAmount values 
id <- gsub('.*"id" : "([[:alnum:]]*)".*', "\\1", idTenderAmount) 
tenderAmount <- gsub('.*"amount" : ([[:alnum:]]*).*', "\\1", idTenderAmount) 

# find the number of bids per Id in order to find number of times id and tenderAmount needs to be repeated 
numBidsPerId <- gregexpr("value", awardSubChunks) 
numBidsTotal <- sapply(numBidsPerId, length) 

# putting things together 
df <- data.frame(id = rep(id, numBidsTotal), 
       tenderAmount = rep(tenderAmount, numBidsTotal), 
       status = status, 
       bidAmount = bidAmount)