2017-03-02 67 views
1

我一直在尋找一個針對我的問題的直觀解決方案。 我有一個巨大的單詞列表,其中我必須根據一些條件插入一個特殊字符。 因此,如果兩/三個字母詞出現在一個小區,我想加上「+」左右吧根據現有字詞在R中插入特殊字符

global b2b banking會轉化爲global +b2b+ banking

how to finance commercial ale estate會轉化爲how +to+ finance commercial +ale+ estate

下面是示例數據集:

sample <- c("commercial funding", 
"global b2b banking" 
"how to finance commercial ale estate" 
"opening a commercial account", 
"international currency account", 
"miami imports banking", 
"hsbc supply chain financing", 
"international business expansion", 
"grow business in Us banking", 
"commercial trade Asia Pacific", 
"business line of credits hsbc", 
"Britain commercial banking", 
"fx settlement hsbc", 
"W Hotels") 
data <- data.frame(sample) 

此外,是否可以刪除具有長度爲1的字符的行? 實施例:

W Hotels 

對於所有的單字母字我試圖與GSUB除去它們,

gsub(" *\\b[[:alpha:]]{1,1}\\b *", " ", sample) 

這應該從設置的數據集合中移除。

任何幫助,高度讚賞。

編輯1

感謝您的幫助,我添加了幾行吧:

sample <- c("commercial funding", "global b2b banking", "how to finance commercial ale estate", "opening a commercial account","international currency account","miami imports banking","hsbc supply chain financing","international business expansion","grow business in Us banking", "commercial trade Asia Pacific","business line of credits hsbc","Britain commercial banking","fx settlement hsbc", "W Hotels") 
sample <- sample[!grepl("\\b[[:alpha:]]\\b",sample)] 
sample <- gsub("\\b([[:alpha:][:digit:]]{2,3})\\b", "+\\1+", sample) 
sample <- gsub(" ",",",sample) 
sample <- gsub("+,","+",sample) 
sample <- gsub(",+","+",sample) 
sample <- tolower(sample) 
sample <- ifelse(substr(sample, 1, 1) == "+", sub("^.", "", sample), sample) 
data <- data.frame(sample) 
data 




              sample 
1        commercial++funding 
2       global+++b2b+++banking 
3 how++++to+++finance++commercial+++ale+++estate 
4    international++currency++account 
5       miami++imports++banking 
6     hsbc++supply++chain++financing 
7    international++business++expansion 
8    grow++business+++in++++us+++banking 
9    commercial++trade++asia++pacific 
10   business++line+++of+++credits++hsbc 
11     britain++commercial++banking 
12       fx+++settlement++hsbc 

不知怎的,我無法刪除 「+」 與 「」 與GSUB?我究竟做錯了什麼 ? 所以"fx+,settlement,hsbc"應該是"fx+settlement,hsbc",但它正在取代,另外還有++。

+0

所以,你的意思是你想刪除包含整個單詞只由一個字母的任何項目? –

+0

是的,所以任何一行如果它有多個單詞,但如果一個單詞有一個長度,我想刪除該行,然後剩下的我想在兩個字母和三個字母單詞之前和之後添加特殊字符「+」。 – PSraj

+1

好,那麼,你有什麼嘗試? –

回答

2

您需要在2個步驟中完成此操作:用1個字母的整個單詞刪除項目,然後將約2-3個字母的單詞添加到+

使用

sample <- c("commercial funding", "global b2b banking", "how to finance commercial ale estate", "opening a commercial account","international currency account","miami imports banking","hsbc supply chain financing","international business expansion","grow business in Us banking", "commercial trade Asia Pacific","business line of credits hsbc","Britain commercial banking","fx settlement hsbc", "W Hotels") 
sample <- sample[!grepl("\\b[[:alnum:]]\\b",sample)] 
sample <- gsub("\\b([[:alnum:]]{2,3})\\b", "+\\1+", sample) 
data <- data.frame(sample) 
data 

R demo

sample[!grepl("\\b[[:alnum:]]\\b",sample)]刪除包含單詞邊界(\b),信([[:alnum:]])和字邊界模式的項目。

gsub("\\b([[:alnum:]]{2,3})\\b", "+\\1+", sample)行代替所有2-3個字母的整個單詞,這些單詞用+括起來。

結果:

         sample 
1       commercial funding 
2      global +b2b+ banking 
3 +how+ +to+ finance commercial +ale+ estate 
4    international currency account 
5      miami imports banking 
6     hsbc supply chain financing 
7   international business expansion 
8    grow business +in+ +Us+ banking 
9    commercial trade Asia Pacific 
10   business line +of+ credits hsbc 
11     Britain commercial banking 
12      +fx+ settlement hsbc 

注意W Hotelsopening a commercial account得到過濾掉。

答到編輯

你增加了一些替換操作的代碼,但使用的是文字字符串替換,因此,你只需要通過fixed=TRUE說法:

sample <- gsub(" ",",",sample, fixed=TRUE) 
sample <- gsub("+,","+",sample, fixed=TRUE) 
sample <- gsub(",+","+",sample, fixed=TRUE) 

否則,+被視爲正則表達式量詞,必須轉義爲字面加號。

另外,如果你需要從字符串的開頭刪除所有+,使用

sample <- sub("^\\++", "", sample) 
+1

如果'b2b'要變成'+ b2b +',你需要在模式中包含'[:digit:]''。 – coletl

+0

我用'[[:alnum:]]'(字母+數字)替換了所有'[[:alpha:]]'(只是字母)。讓OP決定用什麼來過濾以及用什麼來包裝。 –

+0

你的解決方案效果很好,只是最後一件事我堅持,我無法gsub +,只是+,你能幫助嗎? – PSraj