2016-11-26 86 views
2

有沒有一種方法可以在數據行中搜索模式,然後將它們存儲在新表的不同列中?例如,如果我需要從身體下面抽出量,鈔票和硬幣,你認爲這是可能實現R上R中的文本挖掘搜索和提取信息

user_id |  ts |     body     | address |  
3633|  2016-09-29| A wallet with amount = $ 100 has been found with 4 bills and 5 coins| TEST |  
4266|  2016-07-20| A purse having amount = $ 150 has been found with 40 bills and 15 coins| NAME | 
7566|  2016-07-20| A pocket having amount = $ 200 has been found with 4 bills and 5 coins| HELLO | 

期望的結果(這是期望的結果

user_id | Amount | Bills| Coins| 
3633  | $100 | 4 |  5| 
4266  | $150 | 40 | 15| 
7566  | $200 | 10 | 10| 
+0

是的,這是可能的。你會想要使用正則表達式。見'?regex'。對[此效果]有些東西(http://stackoverflow.com/questions/14159690/regex-grep-strings-containing-us-currency)。 –

回答

0

下面是一個解決方案stringrlapply,但必須有更多。首先子集只有user.idbody柱將類似以下內容:

df <- data.frame(user.id = c(3633, 4266, 7566), 
     body = c("A wallet with amount = $ 100 has been found with 4 bills and 5 coins", 
       "A purse having amount = $ 150 has been found with 40 bills and 15 coins", 
       "A pocket having amount = $ 200 has been found with 4 bills and 5 coins")) 

現在,我們將應用正則表達式的df所有行的數字解壓縮到一個列表中,選擇不公開,轉化爲矩陣指定列名,轉置和cbinduser.id從原始數據幀。

library(stringr) 
mat <- t(matrix(unlist(lapply(df, str_match_all, "[0-9]+")[2]), nrow = nrow(df))) 
colnames(mat) <- c("Amount", "Bills", "Coins") 
outputdf <- cbind(df[1], mat) 

這給:

> outputdf 
# user.id Amount Bills Coins 
#1 3633 100  4  5 
#2 4266 150 40 15 
#3 7566 200  4  5 

我敢肯定,大概有這樣做太的更合適的方法。