R：拆分字符串＆根據拆分分配變量

我有一個語義標籤字段&語義標籤類型。每個標籤類型/標籤用逗號分隔，而每個標籤類型&標籤以冒號分隔（見下文）。R：拆分字符串＆根據拆分分配變量

ID | Semantic Tags 

1 | Person:mitch mcconnell, Person:ashley judd, Position:senator 

2 | Person:mitch mcconnell, Position:senator, ProvinceOrState:kentucky, topicname:politics 

3 | Person:mitch mcconnell, Person:ashley judd, Organization:senate, Organization:republican 

4 | Person:ashley judd, topicname:politics 

5 | URL:www.huffingtonpost.com, Company:usa today, Person:chuck todd, Company:msnbc

我想每個標籤類型（冒號前的術語）&標籤（冒號後的術語）分成兩個獨立的領域：「標籤類型」 &「標籤」。最終的文件應該是這個樣子：

ID | Tag Type | Tag 

1 | Person | mitch McConnell 

1 | Person | ashley judd 

1 | Position | senator 

2 | Person | mitch McConnell 

2 | Position | senator 

2 | State | kentucky

這裏是我到目前爲止的代碼...

但在那之後，我迷路了！我相信我需要使用lapply或sapply爲此，但不知道在哪裏播放...

我的道歉，如果這已被回答在網站上的其他形式 - 我是新來的R &這是對我來說仍然有點複雜。

在此先感謝任何人的幫助。

來源

2013-04-09 NiuBiBang

能否請您使用'dput（emtable）提供了一個可重複的例子'（或'dput （head（emtable））'如果這是太多的數據？） – 2013-04-09 15:03:42

我已經重新格式化數據，看起來像他們的表格佈局。 – NiuBiBang 2013-04-09 15:18:27

你爲什麼不使用'dput'？它使回答者更容易 – 2013-04-09 15:21:40

這是另一種（略有不同）的方法：

## dat <- readLines(n=5) 
## Person:mitch mcconnell, Person:ashley judd, Position:senator 
## Person:mitch mcconnell, Position:senator, ProvinceOrState:kentucky, topicname:politics 
## Person:mitch mcconnell, Person:ashley judd, Organization:senate, Organization:republican 
## Person:ashley judd, topicname:politics 
## URL:www.huffingtonpost.com, URL:http://www.regular-expressions.info 

dat3 <- lapply(strsplit(dat, ","), function(x) gsub("^\\s+|\\s+$", "", x)) 
#dat3 <- lapply(dat2, function(x) x[grepl("Person|Position", x)]) 
dat3 <- lapply(dat3, strsplit, ":(?!/)", perl=TRUE) #break on : not folled by/
dat3 <- data.frame(ID=rep(seq_along(dat3), sapply(dat3, length)), 
    do.call(rbind, lapply(dat3, function(x) do.call(rbind, x))) 
) 

colnames(dat3)[-1] <- c("Tag Type", "Tag") 

## ID  Tag Type     Tag 
## 1 1   Person  mitch mcconnell 
## 2 1   Person   ashley judd 
## 3 1  Position    senator 
## 4 2   Person  mitch mcconnell 
## 5 2  Position    senator 
## 6 2 ProvinceOrState    kentucky 
## 7 2  topicname    politics 
## 8 3   Person  mitch mcconnell 
## 9 3   Person   ashley judd 
## 10 3 Organization     senate 
## 11 3 Organization    republican 
## 12 4   Person   ashley judd 
## 13 4  topicname    politics 
## 14 5    URL www.huffingtonpost.com 
## 15 5   Company    usa today 
## 16 5   Person    chuck todd 
## 17 5   Company     msnbc

詳盡的解釋：

## dat <- readLines(n=5) 
## Person:mitch mcconnell, Person:ashley judd, Position:senator 
## Person:mitch mcconnell, Position:senator, ProvinceOrState:kentucky, topicname:politics 
## Person:mitch mcconnell, Person:ashley judd, Organization:senate, Organization:republican 
## Person:ashley judd, topicname:politics 
## URL:www.huffingtonpost.com, URL:http://www.regular-expressions.info 

dat3 <- lapply(strsplit(dat, ","), function(x) gsub("^\\s+|\\s+$", "", x)) 
#dat3 <- lapply(dat2, function(x) x[grepl("Person|Position", x)]) 
dat3 <- lapply(dat3, strsplit, ":(?!/)", perl=TRUE) #break on : not folled by/

# Let the explanation begin... 

# Here I have a short list of the variables from the rows 
# of the original dataframe; in this case the row numbers: 

seq_along(dat3)  #row variables 

# then I use sapply and length to figure out hoe long the 
# split variables in each row (now a list) are 

sapply(dat3, length) #n times 

# this tells me how many times to repeat the short list of 
# variables. This is because I stretch the dat3 list to a vector 
# Here I rep the row variables n times 

rep(seq_along(dat3), sapply(dat3, length)) 

# better assign that for later: 

ID <- rep(seq_along(dat3), sapply(dat3, length)) 

#============================================ 
# Now to explain the next chunk... 
# I take dat3 

dat3 

# Each element in the list 1-5 is made of a new list of 
# Vectors of length 2 of Tag_Types and Tags. 
# For instance here's element 5 a list of two lists 
# with character vectors of length 2 

## [[5]] 
## [[5]][[1]] 
## [1] "URL" "www.huffingtonpost.com" 
## 
## [[5]][[2]] 
## [1] "URL" "http://www.regular-expressions.info" 

# Use str to look at this structure: 

dat3[[5]] 
str(dat3[[5]]) 

## List of 2 
## $ : chr [1:2] "URL" "www.huffingtonpost.com" 
## $ : chr [1:2] "URL" "http://www.regular-expressions.info" 

# I use lapply (list apply) to apply an anynomous function: 
# function(x) do.call(rbind, x) 
# 
# TO each of the 5 elements. This basically glues the list 
# of vectors together to make a matrix. Observe just on elenet 5: 

do.call(rbind, dat3[[5]]) 

##  [,1] [,2]         
## [1,] "URL" "www.huffingtonpost.com"    
## [2,] "URL" "http://www.regular-expressions.info" 

# We use lapply to do that to all elements: 

lapply(dat3, function(x) do.call(rbind, x)) 

# We then use the do.call(rbind on this list and we have a 
# matrix 

do.call(rbind, lapply(dat3, function(x) do.call(rbind, x))) 

# Let's assign that for later: 

the_mat <- do.call(rbind, lapply(dat3, function(x) do.call(rbind, x))) 

#============================================  
# Now we put it all together with data.frame: 

data.frame(ID, the_mat)

來源

2013-04-09 15:29:14

這似乎是在做伎倆。但是，當我運行第三個命令時，我們需要執行以下命令：dlbly（lably，lapply（dat3，function（x），dlbl） do.call（rbind，X））））' 我得到以下信息：錯誤函數（...，deparse.level = 1）：數矩陣的列必須匹配（見ARG 2 ）此外：有50條或更多警告（使用警告（）查看前50條） – NiuBiBang 2013-04-10 18:23:54

此問題僅針對您的數據，並不像您在此顯示的數據。你可以使用debug這樣的調試工具來找出第一個問題，第二個問題我會按照它的說法來做，並使用'warnings（）'來更具體地查看爲什麼你會得到你所做的警告。 – 2013-04-10 18:58:11

是的，我看到我的一個標籤類型是URL，它經常包含「http：」 - 最終在分割「：」時將矩陣分成非統一數量的列。所以我只是添加了一行代碼來刪除「http：」，b/n第一和第二strsplit代碼。 – NiuBiBang 2013-04-14 01:36:55

DF 
## ID                     Semantic.Tags 
## 1 1         Person:mitch mcconnell, Person:ashley judd, Position:senator 
## 2 2  Person:mitch mcconnell, Position:senator, ProvinceOrState:kentucky, topicname:politics 
## 3 3  Person:mitch mcconnell, Person:ashley judd, Organization:senate, Organization:republican 
## 4 4               Person:ashley judd, topicname:politics 
## 5 5    URL:www.huffingtonpost.com, Company:usa today, Person:chuck todd, Company:msnbc 


ll <- lapply(strsplit(DF$Semantic.Tags, ","), strsplit, split = ":") 

f <- function(x) do.call(rbind, x) 

f(lapply(ll, f)) 
##  [,1]    [,2]      
## [1,] "  Person"  "mitch mcconnell"  
## [2,] " Person"   "ashley judd"   
## [3,] " Position"  "senator"    
## [4,] "  Person"  "mitch mcconnell"  
## [5,] " Position"  "senator"    
## [6,] " ProvinceOrState" "kentucky"    
## [7,] " topicname"  "politics "    
## [8,] "  Person"  "mitch mcconnell"  
## [9,] " Person"   "ashley judd"   
## [10,] " Organization" "senate"     
## [11,] " Organization" "republican "   
## [12,] "  Person"  "ashley judd"   
## [13,] " topicname"  "politics"    
## [14,] "  URL"   "www.huffingtonpost.com" 
## [15,] " Company"   "usa today"    
## [16,] " Person"   "chuck todd"    
## [17,] " Company"   "msnbc"

來源

2013-04-09 15:18:51

（+1）或者'matrix（rapply（ll，rbind），ncol = 2，byrow = TRUE）'最後兩步。 – Henrik 2013-04-09 15:25:44

或更透明：'matrix（rapply（ll，identity），ncol = 2，byrow = TRUE）' – Henrik 2013-04-09 15:31:24

Thanks guys，我實際上使用了上述三種方法的代碼組合。結束工作。 – NiuBiBang 2013-04-14 01:33:46

R：拆分字符串＆根據拆分分配變量

回答

相關問題