正如我對學習更多關於NLP下一步,我想實現一個簡單的啓發式改善超出了簡單的n-gram結果。結合nltk.RegexpParser語法
根據下面鏈接的斯坦福搭配PDF,他們提到通過「只通過可能成爲」短語「的那些模式的部分語音過濾器傳遞」候選短語「將產生比簡單地使用最頻繁的結果更好的結果存在的雙克 來源:搭配,第143頁 - 144:https://nlp.stanford.edu/fsnlp/promo/colloc.pdf
144頁上的表中有7個標籤圖案在順序,NLTK POS標籤等效爲:
JJ NN
。 NN
JJ JJ NN
JJ NN NN
NN JJ NN
NN NN NN
NN IN NN
在下面的代碼,我可以得到所希望的結果時我獨立以下應用每個語法。但是,當我嘗試合併相同的語法時,我沒有收到預期的結果。
在我的代碼,你可以看到,我去掉一個句子中,取消1個語法,運行它,並檢查結果。
我應該能夠通過合併語法(只是在下面的代碼,其中3)所有的句子組合,運行它,並得到想要的結果。
我的問題是,我該如何正確地結合語法?
我假設,結合語法就像是一個「OR」,發現這個圖案,或者這種模式...
在此先感謝。
import nltk
# The following sentences are correctly grouped with <JJ>*<NN>+.
# Should see: 'linear function', 'regression coefficient', 'Gaussian random variable' and
# 'cumulative distribution function'
SampleSentence = "In mathematics, the term linear function refers to two distinct, although related, notions"
#SampleSentence = "The regression coefficient is the slope of the line of the regression equation."
#SampleSentence = "In probability theory, Gaussian random variable is a very common continuous probability distribution."
#SampleSentence = "In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x."
# The following sentences are correctly grouped with <NN.?>*<V.*>*<NN>
# Should see 'mean squared error' and # 'class probability function'.
#SampleSentence = "In statistics, the mean squared error (MSE) of an estimator measures the average of the squares of the errors, that is, the difference between the estimator and what is estimated."
#SampleSentence = "The class probability function is interesting"
# The sentence below is correctly grouped with <NN.?>*<IN>*<NN.?>*.
# should see 'degrees of freedom'.
#SampleSentence = "In statistics, the degrees of freedom is the number of values in the final calculation of a statistic that are free to vary."
SampleSentence = SampleSentence.lower()
print("\nFull sentence: ", SampleSentence, "\n")
tokens = nltk.word_tokenize(SampleSentence)
textTokens = nltk.Text(tokens)
# Determine the POS tags.
POStagList = nltk.pos_tag(textTokens)
# The following grammars work well *independently*
grammar = "NP: {<JJ>*<NN>+}"
#grammar = "NP: {<NN.?>*<V.*>*<NN>}"
#grammar = "NP: {<NN.?>*<IN>*<NN.?>*}"
# Merge several grammars above into a single one below.
# Note that all 3 correct grammars above are included below.
'''
grammar = """
NP:
{<JJ>*<NN>+}
{<NN.?>*<V.*>*<NN>}
{<NN.?>*<IN>*<NN.?>*}
"""
'''
cp = nltk.RegexpParser(grammar)
result = cp.parse(POStagList)
for subtree in result.subtrees(filter=lambda t: t.label() == 'NP'):
print("NP Subtree:", subtree)
如果你能幫助我瞭解更多,你不想寫3個獨立的行這樣的語法= 「」」 NP: { * +} { * * } { * * *} 「」「。相反,你需要一個單行的正則表達式模式,可以容納所有3種模式。 –
嗨拉胡爾。我想,讓他們產生,他們分別產生相同的結果以某種方式結合3種正則表達式模式。我很公正,如何用1,2,3以上的線條寫出來。我會在接下來的幾天嘗試下面的代碼。謝謝。 – RandomTask
當然,繼續!我已經嘗試過多種場景,並且它已成功。嘗試並回到其他任何問題 –