2016-09-26 143 views
2

我有以下測試用例分裂單字的單詞,但不知道如何做它的JavaScript。Javascript正則表達式與Unicode和標點符號

describe("garden: utils",() => { 
    it("should split correctly",() => { 
    assert.deepEqual(segmentation('Hockey is a popular sport in Canada.'), [ 
     'Hockey', 'is', 'a', 'popular', 'sport', 'in', 'Canada', '.' 
    ]); 

    assert.deepEqual(segmentation('How many provinces are there in Canada?'), [ 
     'How', 'many', 'provinces', 'are', 'there', 'in', 'Canada', '?' 
    ]); 

    assert.deepEqual(segmentation('The forest is on fire!'), [ 
     'The', 'forest', 'is', 'on', 'fire', '!' 
    ]); 

    assert.deepEqual(segmentation('Emily Carr, who was born in 1871, was a great painter.'), [ 
     'Emily', 'Carr', ',', 'who', 'was', 'born', 'in', '1871', ',', 'was', 'a', 'great', 'painter', '.' 
    ]); 

    assert.deepEqual(segmentation('This is David\'s computer.'), [ 
     'This', 'is', 'David', '\'', 's', 'computer', '.' 
    ]); 

    assert.deepEqual(segmentation('The prime minister said, "We will win the election."'), [ 
     'The', 'prime', 'minister', 'said', ',', '"', 'We', 'will', 'win', 'the', 'election', '.', '"' 
    ]); 

    assert.deepEqual(segmentation('There are three positions in hockey: goalie, defence, and forward.'), [ 
     'There', 'are', 'three', 'positions', 'in', 'hockey', ':', 'goalie', ',', 'defence', ',', 'and', 'forward', '.' 
    ]); 

    assert.deepEqual(segmentation('The festival is very popular; people from all over the world visit each year.'), [ 
     'The', 'festival', 'is', 'very', 'popular', ';', 'people', 'from', 'all', 'over', 'the', 'world', 
     'visit', 'each', 'year', '.' 
    ]); 

    assert.deepEqual(segmentation('Mild, wet, and cloudy - these are the characteristics of weather in Vancouver.'), [ 
     'Mild', ',', 'wet', ',', 'and', 'cloudy', '-', 'these', 'are', 'the', 'characteristics', 'of', 'weather', 
     'in', 'Vancouver', '.' 
    ]); 

    assert.deepEqual(segmentation('sweet-smelling'), [ 
     'sweet', '-', 'smelling' 
    ]); 
    }); 

    it("should not split unicoded words",() => { 
    assert.deepEqual(segmentation('hacer a propósito'), [ 
     'hacer', 'a', 'propósito' 
    ]); 

    assert.deepEqual(segmentation('nhà em có con mèo'), [ 
     'nhà', 'em', 'có', 'con', 'mèo' 
    ]); 
    }); 

    it("should group periods",() => { 
    assert.deepEqual(segmentation('So are ... the fishes.'), [ 
     'So', 'are', '...', 'the', 'fishes', '.' 
    ]); 

    assert.deepEqual(segmentation('So are ...... the fishes.'), [ 
     'So', 'are', '......', 'the', 'fishes', '.' 
    ]); 

    assert.deepEqual(segmentation('arriba arriba ja....'), [ 
     'arriba', 'arriba', 'ja', '....' 
    ]); 
    }); 
}); 

這裏是蟒蛇等價表達:

class Segmentation(BaseNLPProcessor): 
    pattern = re.compile('((?u)\w+|\.{2,}|[%s])' % string.punctuation) 

    @classmethod 
    def ignore_value(cls, value): 
     # type: (str) -> bool 
     return negate(compose(is_empty, string.strip))(value) 

    def split(self): 
     # type:() -> List[str] 
     return filter(self.ignore_value, self.pattern.split(self.value())) 

我想寫一個同等功能的Python爲JavaScript的通過unicoded字和標點符號,組由多個點分裂...

Segmentation("Hockey is a popular sport in Canada.").split() 

回答

3

由於RegExp中沒有負面的後視斷言,Unicode支持尚未正式發佈(目前僅支持Firefox中的標誌),這相當複雜。這使用庫(XRegExp)來處理unicode類。如果你需要完整的正常表達式,這是巨大的。只需發表評論並告訴我們,我將更新答案,以使用包含Unicode範圍的爆炸正常RegExp語句。

const rxLetterToOther = XRegExp('(\\p{L})((?!\\s)\\P{L})','g'); 
const rxOtherToLetter = XRegExp('((?!\\s)\\P{L})(\\p{L})','g'); 
const rxNumberToOther = XRegExp('(\\p{N})((?!\\s)\\P{N})','g'); 
const rxOtherToNumber = XRegExp('((?!\\s)\\P{N})(\\p{N})','g'); 
const rxPuctToPunct = XRegExp('(\\p{P})(\\p{P})','g'); 
const rxSep = XRegExp('\\s+','g'); 

function segmentation(s) { 
    return s 
    .replace(rxLetterToOther, '$1 $2') 
    .replace(rxOtherToLetter, '$1 $2') 
    .replace(rxNumberToOther, '$1 $2') 
    .replace(rxOtherToNumber, '$1 $2') 
    .replace(rxPuctToPunct, '$1 $2') 
    .split(rxSep); 
} 

Here it is passing all the test cases!

window.onbeforeunload = "";
* { margin: 0; padding: 0; border: 0; overflow: hidden; } 
 
object { width: 100%; height: 100%; width: 100vw; height: 100vh; }
<object data="https://fiddle.jshell.net/a3tf68ae/14/show/" />

編輯:更新測試用例打印測試結果下的巨大的RegExp來源。運行該代碼段以查看嵌入式測試用例。

+0

https://jsfiddle.net/hungphan/9u0javhg/ –

+0

@HungPhan你去那裏的人。這是一個艱難的。 – TylerY86

+0

謝謝@ TylerY86。 –

1

我找到了答案,但是很複雜。有沒有人有另一種簡單的回答這個

module.exports = (string) => { 
    const segs = string.split(/(\.{2,}|!|"|#|$|%|&|'|\(|\)|\*|\+|,|-|\.|\/|:|;|<|=|>|\?|¿|@|[|]|\\|^|_|`|{|\||}|~|)/); 

    return segs.filter((seg) => seg.trim() !== ""); 
}; 
+0

你的語法有一些錯誤...你確定你正確地粘貼了嗎? – TylerY86

+0

這裏插入測試用例; https://jsfiddle.net/9u0javhg/17/ – TylerY86

+0

僅供參考,它失敗了一些...... – TylerY86