包含unicode字符的分詞包含的字詞

我正在致力於一個涉及表情符號的NLP項目。包含unicode字符的分詞包含的字詞

鳴叫的例子這裏給出：
"sometimes i wish i wa an octopus so i could slap 8 people at once"

我的問題是once被認爲是一個字，所以我想這唯一字拆分成兩個，這樣我的鳴叫是這樣的：
"sometimes i wish i wa an octopus so i could slap 8 people at once "

請注意，我已經有編譯正則表達式包含每個emojis！

我正在尋找一種有效的方法，因爲我有成千上萬的推文，但我無法弄清楚從哪裏開始。

謝謝

來源

2016-02-29 Thomas Reynaud

你就不能這樣做：

>>> import re 
>>> s = "sometimes i wish i wa an octopus so i could slap 8 people at once" 
>>> re.findall("(\w+|[^\w ]+)",s) 
['sometimes', 'i', 'wish', 'i', 'wa', 'an', 'octopus', 'so', 'i', 'could', 'slap', '8', 'people', 'at', 'once', '']

如果再次需要它們作爲單個空格分隔的字符串，就加入他們的行列：

>>> " ".join(re.findall("(\w+|[^\w ]+)",s)) 
'sometimes i wish i wa an octopus so i could slap 8 people at once '

修改：固定。

來源

2016-02-29 03:07:18 L3viathan

謝謝你的快速回答，你能向我解釋的正則表達式如何找到之間的邊界實際的單詞和unicode字符？ –

@ThomasReynaud它首先嚐試匹配一個單詞字符序列（'\ w'），但沒有表情符號是該類的一部分。一旦匹配'「一次」，匹配就會停止，因爲它不能匹配任何更多的單詞字符。所以從那個位置開始，它會搜索下一場比賽，試圖找到*不是空格字符的*任何東西。實際上，如果表情符號不是最終結果，這種方法是有缺陷的，我想，讓我再測試一下。 – L3viathan

@ThomasReynaud我稍微改變了正則表達式。它現在或者匹配單詞字符序列，或者既不是空格又不是單詞字符的任何序列。 – L3viathan

您可以使用re.sub引入空間：

re.sub(r'(\W+)(?= |$)', r' \1', string)

例子：

>>> string 
'sometimes i wish i wa an octopus so i could slap 8 people at once\xf0\x9f\x90\x99' 
>>> re.sub(r'(\W+)(?= |$)', r' \1', string) 
'sometimes i wish i wa an octopus so i could slap 8 people at once \xf0\x9f\x90\x99' 

>>> string = 'sometimes i wish i wa an octopus so i could slap 8 people at once" foobar' 
>>> re.sub(r'(\W+)(?= |$)', r' \1', string) 
'sometimes i wish i wa an octopus so i could slap 8 people at once \xf0\x9f\x90\x99 foobar'

來源

2016-02-29 03:22:32 heemayl

包含unicode字符的分詞包含的字詞

回答

相關問題