2016-04-24 194 views
0

我有2個數據幀,第一列是一個列表(df A),另一列的第一列包含列表中的項目,但在某些情況下每行有多個項目(df B)。 我想要做的就是去通過,並從一個DF每個項目創建新行什麼,發生在DF B的第一列根據另一個數據幀中的列創建新的數據幀行

DF一

dfA 
    Index X 
1 1 alpha 
2 2 beta 
3 3 gamma 
4 4 delta 

DF乙

dfB 
    list X 
1 1 4 alpha 
2 3 2 1 beta 
3 4 1 2 gamma 
4 3  delta 

期望

dfC 
    Index x 
1 1  alpha 
2 4  alpha 
3 3  beta 
4 2  beta 
5 1  beta 
6 4  gamma 
7 1  gamma 
8 2  gamma 
9 3  delta 

我使用的實際數據: DF一

dput(head(allwines)) 
structure(list(Wine = c("Albariño", "Aligoté", "Amarone", "Arneis", 
"Asti Spumante", "Auslese"), Description = c("Spanish white wine grape that makes crisp, refreshing, and light-bodied wines.", 
"White wine grape grown in Burgundy making medium-bodied, crisp, dry wines with spicy character.", 
"From Italy’s Veneto Region a strong, dry, long- lived red, made from a blend of partially dried red grapes.", 
"A light-bodied dry wine the Piedmont Region of Italy", "From the Piedmont Region of Italy, A semidry sparkling wine produced from the Moscato di Canelli grape in the village of Asti", 
"German white wine from grapes that are very ripe and thus high in sugar" 
)), .Names = c("Wine", "Description"), row.names = c(NA, 6L), class = "data.frame") 

DF乙

> dput(head(cheesePairing)) 
structure(list(Wine = c("Cabernet Sauvignon\r\n        \r\n       \r\n      \r\n       \r\n        \r\n         Pinot Noir\r\n        \r\n       \r\n      \r\n       \r\n        \r\n         Sauvignon Blanc\r\n        \r\n       \r\n      \r\n       \r\n        \r\n         Zinfandel", 
"Chianti\r\n        \r\n       \r\n      \r\n       \r\n        \r\n         Pinot Noir\r\n        \r\n       \r\n      \r\n       \r\n        \r\n         Sangiovese", 
"Chardonnay", "Bardolino\r\n        \r\n       \r\n      \r\n       \r\n        \r\n         Malbec\r\n        \r\n       \r\n      \r\n       \r\n        \r\n         Riesling\r\n        \r\n       \r\n      \r\n       \r\n        \r\n         Rioja\r\n        \r\n       \r\n      \r\n       \r\n        \r\n         Sauvignon Blanc", 
"Tempranillo", "Asti Spumante"), Cheese = c("Abbaye De Belloc Cheese", 
"Ardrahan cheese", "Asadero cheese", "Asiago cheese", "Azeitao", 
"Baby Swiss Cheese"), Suggestions = c("Pair with apples, sliced pears OR a sampling of olives and thin sliced salami. Pass around slices of baguette.", 
"Serve with a substantial wheat cracker and apples or grapes.", 
"Rajas (blistered chile strips) fresh corn tortillas", "Table water crackers, raw nuts (almond, walnuts)", 
"Nutty brown bread, grapes", "Server with dried fruits, whole grain, nutty breads, nuts" 
)), .Names = c("Wine", "Cheese", "Suggestions"), row.names = c(NA, 
6L), class = "data.frame") 
+0

如果您可以編輯您的問題以將您的示例數據包含在R可解析格式中將會很有幫助。例如。 'dput(dfA)'和'dput(dfB)'。 –

+0

@CurtF。我添加了我的示例數據,我擔心它可能太混亂了,所以我刪除了它並將其編入示例。 –

+1

我不確定'DFA'的用途是什麼。 'DFB'中的葡萄酒中有一些額外的空格,所以你可以將它們替換爲逗號'cheesePairing $ Wine < - gsub('\\ s {2,}',',',df $ Wine)'現在使用[這個問題](http://stackoverflow.com/questions/28285169/split-comma-separated-column-entry-into-rows)或其他類似的答案之一 – rawr

回答

2

爲了解決柯特的答案,我想我找到了一個更有效的解決方案......假設我正確地解釋了你的目標。

我的評論代碼是在下面。您應該能夠按原樣運行並獲得所需的dfC。有一點需要注意的是,我假設(根據您的實際數據)分隔符分裂dfB $索引是「\ r \ n」。

# set up fake data 
dfA<-data.frame(Index=c('1','2','3','4'), X=c('alpha','beta','gamma','delta')) 
dfB<-data.frame(Index=c('1 \r\n 4','3 \r\n 2 \r\n 1','4 \r\n 1 \r\n 2','3'), X=c('alpha','beta','gamma','delta')) 

dfA$Index<-as.character(dfA$Index) 
dfA$X<-as.character(dfA$X) 
dfB$Index<-as.character(dfB$Index) 
dfB$X<-as.character(dfB$X) 


dfB_index_parsed<-strsplit(x=dfB$Index,"\r\n") # split Index of dfB by delimiter "\r\n" and store in a list 
names(dfB_index_parsed)<-dfB$X 
dfB_split_num<-lapply(dfB_index_parsed, length) # find the number of splits per row of dfB and store in a list 
dfB_split_num_vec<-do.call('c', dfB_split_num) # convert number of splits above from list to vector 

g<-do.call('c',dfB_index_parsed) # store all split values in a single vector 
g<-gsub(' ','',g) # remove trailing/leading spaces that occur after split 
names(g)<-rep(names(dfB_split_num_vec), dfB_split_num_vec) # associate each split Index from dfB with X from dfB 
g<-g[g %in% dfA$Index] # check which dfB$Index are in dfA$Index 

dfC<-data.frame(Index=g, X=names(g)) # construct data.frame 
+0

當我運行這個我結束了一個空白的數據框,但是前幾個步驟似乎正朝着正確的方向發展。出於某種原因,拼搶創造了很多額外的\ r \ n,所以通過第三步,拆分數量完全關閉。我會嘗試刪除任何空白,並看看是否有幫助 –

+0

哦,很奇怪。我只是將我的代碼複製到另一個R會話中,並且運行良好。無論如何,很高興聽到它有所幫助。我認爲strsplit()+ gsub()函數對於解決這個問題的任何策略都是至關重要的。 regexpr()也可能有幫助。同時檢查你正在使用的scraping軟件包是否具有處理這些刮擦結果的內置函數。 – AOGSTA

+0

有沒有證據表明這實際上更有效率?無論哪種方式,很好的答案和我+1。 –

0

首先,讓我提供一個功能回答你的問題。我懷疑我的答案是非常有效的,但它有效。

# construct toy data 
dfA <- data.frame(index = 1:4, X = letters[1:4]) 

dfB <- data.frame(X = letters[1:4]) 
dfB$list_elements <- list(c(1, 4), c(3, 2, 1), c(4, 1, 2), c(3)) 

# define function that provides solution 

unlist_merge_df <- function(listed_df, reference_df){ 
    # reference_df assumed to have columns "X" and "index" 
    # listed_df assumed to have column "list_elements" 
    df_out <- data.frame(index = c(), X = c()) 
    my_list <- listed_df$list_elements 
    for(idx in 1:length(my_list)){ 
     df_out <- rbind(df_out, 
         data.frame(index = my_list[[idx]], 
            X = listed_df[idx, 'X']) 
         ) 
    } 
    return(df_out) 
} 

# call the function 
dfC <- unlist_merge_df(dfB, dfA) 

# show output in human and R-parseable formats 
dfC 

dput(dfC) 

輸出是:

index X 
1 1 a 
2 4 a 
3 3 b 
4 2 b 
5 1 b 
6 4 c 
7 1 c 
8 2 c 
9 3 d 

structure(list(index = c(1, 4, 3, 2, 1, 4, 1, 2, 3), X = structure(c(1L, 
1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L), .Label = c("a", "b", "c", "d" 
), class = "factor")), .Names = c("index", "X"), row.names = c(NA, 
9L), class = "data.frame") 

其次,讓我說,你所處的情況不是很desireable。如果你能避免它,你可能應該。要麼完全不使用數據框,只使用列表,或者完全避免構建列出的數據框(如果可以的話),並直接構造所需的輸出。

+1

謝謝,我知道這不是一個理想的情況。我通過網絡抓取獲得了數據,並試圖讓它可用於數據庫,但它看起來像我可能不得不在數據庫查詢中做出適當的結果並更加明確 –

相關問題