2016-04-25 48 views
0

我正在尋找一種方式來經過一個句子,看看是否撇號是報價或收縮,所以我可以從字符串中刪除標點,然後規範所有單詞。搞清楚,如果一個單引號是報價或收縮

我的測試一句話是:don't frazzel the horses. 'she said wow'.

在我的努力我已經分裂句成詞的部分tokonizing上字和非詞,像這樣:

contractionEndings = ["d", "l", "ll", "m", "re", "s", "t", "ve"] 

sentence = "don't frazzel the horses. 'she said wow'.".split(/(\w+)|(\W+)/i).reject! { |word| word.empty? } 

這將返回["don", "'", "t", " ", "frazzel", " ", "the", " ", "horses", ". '", "she", " ", "said", " ", "wow", "'."]

下一頁我希望能夠遍歷句子尋找撇號',當找到一個時,比較下一個元素,看它是否包含在contractionEndings數組中。如果包含我想加入前綴,撇號',並將後綴加入一個索引,否則刪除撇號。

在這個例子中,don',和t將被連接成don't作爲一個單一的索引,但. ''.將被移除。

之後,我可以運行一個正則表達式從句子刪除其他標點符號,這樣我可以將它傳遞到我的詞幹正常化輸入。

最終輸出我後don't frazzel the horses she said wow中,所有的標點將除了撇號宮縮被刪除。

如果任何人有任何建議,使這項工作或者有關於如何解決這個問題,我想知道一個更好的主意。

總的來說,我想從句子中刪除所有的標點,除了收縮。

謝謝

+0

什麼導致你想到最後? – Ilya

+0

@Ilya'不frazzel她說wow' –

+2

爲什麼要急於選擇一個答案?爲什麼不等待至少UNT馬那些處理答案的人有機會發布? –

回答

1

正如我在評論中提到的,我認爲試圖列出所有可能的收縮結局是徒勞的。事實上,一些收縮,比如「不可能」,包含不止一個撇號。

另一種選擇是匹配單引號。我的第一個想法是刪除字符"'"如果是在句子的開頭或空格之後,或者如果它跟在空格之後或者在句子的結尾。不幸的是,這種方法受到以「s」結尾的佔有性詞語的困擾:「Chris的貓有跳蚤」。更糟糕的是,我們如何解讀「Chris'汽車'在哪裏?」或者「'在聖誕節前的那個夜晚'。」?

這是一種刪除單引號的方法,當單詞的開頭或結尾沒有撇號時(這確實是有問題的值)。

r =/
    (?<=\A|\s) # match the beginning of the string or a whitespace char in a 
       # positive lookbehind 
    \'   # match a single quote 
    |   # or 
    \'   # match a single quote 
    (?=\s|\z) # match a whitespace char or the end of the string in a 
       # positive lookahead 
    /x   # free-spacing regex definition mode 

"don't frazzel the horses. 'she said wow'".gsub(r,'') 
    #=> "don't frazzel the horses. she said wow" 

我認爲最好的解決方案是讓英語使用不同的符號來表示撇號和單引號。

0

通常撇號在逗號後會保持收縮狀態。

嘗試一個正常的NLP標記器,例如,在python nltk

>>> from nltk import word_tokenize 
>>> word_tokenize("don't frazzel the horses") 
['do', "n't", 'frazzel', 'the', 'horses'] 

對於多個句子:

>>> from string import punctuation 
>>> from nltk import sent_tokenize, word_tokenize 
>>> text = "don't frazzel the horses. 'she said wow'." 
>>> sents = sent_tokenize(text) 
>>> sents 
["don't frazzel the horses.", "'she said wow'."] 
>>> [word for word in word_tokenize(sents[0]) if word not in punctuation] 
['do', "n't", 'frazzel', 'the', 'horses'] 
>>> [word for word in word_tokenize(sents[1]) if word not in punctuation] 
["'she", 'said', 'wow'] 

壓扁句子word_tokenize前:

>>> from itertools import chain 
>>> sents 
["don't frazzel the horses.", "'she said wow'."] 
>>> [word_tokenize(sent) for sent in sents] 
[['do', "n't", 'frazzel', 'the', 'horses', '.'], ["'she", 'said', 'wow', "'", '.']] 
>>> list(chain(*[word_tokenize(sent) for sent in sents])) 
['do', "n't", 'frazzel', 'the', 'horses', '.', "'she", 'said', 'wow', "'", '.'] 
>>> [word for word in list(chain(*[word_tokenize(sent) for sent in sents])) if word not in punctuation] 
['do', "n't", 'frazzel', 'the', 'horses', "'she", 'said', 'wow'] 

注意,單引號保持與'she。可悲的是,符號化的簡單的任務還是有它的弱點之中精良(深)的機器學習方法所有的炒作今日=(

這使得即使有正式的語法文字錯誤:

>>> text = "Don't frazzel the horses. 'She said wow'." 
>>> sents = sent_tokenize(text) 
>>> sents 
["Don't frazzel the horses.", "'She said wow'."] 
>>> [word_tokenize(sent) for sent in sents] 
[['Do', "n't", 'frazzel', 'the', 'horses', '.'], ["'She", 'said', 'wow', "'", '.']] 
1

您可以使用Pragmatic Tokenizer gem它可以檢測English contractions

s = "don't frazzel the horses. 'she said wow'." 
PragmaticTokenizer::Tokenizer.new(punctuation: :none).tokenize(s) 
=> ["don't", "frazzel", "the", "horses", "she", "said", "wow"] 

s = "'Twas the 'night before Christmas'." 
PragmaticTokenizer::Tokenizer.new(punctuation: :none).tokenize(s) 
=> ["'twas", "the", "night", "before", "christmas"] 

s = "He couldn’t’ve been right." 
PragmaticTokenizer::Tokenizer.new(punctuation: :none).tokenize(s) 
=> ["he", "couldn’t’ve", "been", "right"] 
+0

PS - Pragmatic Tokenizer還有一個[展開收縮](https://github.com/diasks2/pragmatic_tokenizer#expand_contractions)選項。 – diasks2