「重新」方法很快就失去了動力。命名實體識別是一個非常複雜的話題,超越了答案的範圍。如果你認爲你對這個問題有很好的解決方法,請將它指向Flann O'Brien又名Myles na cGopaleen,Sukarno,Harry S. Truman,J. Edgar Hoover,JK Rowling,數學家L'Hopital,Joe di Maggio, Algernon Douglas-Montagu-Scott和Hugo Max Graf von und zu Lerchenfeld aufKöferingundSchönberg。
更新以下是基於「重新」的方法,可以找到更多有效的案例。但我仍然認爲這不是一個好方法。注:我在我的文本樣本中對巴伐利亞計數的名稱進行了驗證。如果任何人真的想使用這樣的東西,他們應該使用Unicode,並在某個階段(輸入或輸出)規範化空白。
import re
text1 = """Conan Doyle said that the character of Holmes was inspired by Dr. Joseph Bell, for whom Doyle had worked as a clerk at the Edinburgh Royal Infirmary. Like Holmes, Bell was noted for drawing large conclusions from the smallest observations.[1] Michael Harrison argued in a 1971 article in Ellery Queen's Mystery Magazine that the character was inspired by Wendell Scherer, a "consulting detective" in a murder case that allegedly received a great deal of newspaper attention in England in 1882."""
text2 = """Flann O'Brien a.k.a. Myles na cGopaleen, I Zingari, Sukarno and Suharto, Harry S. Truman, J. Edgar Hoover, J. K. Rowling, the mathematician L'Hopital, Joe di Maggio, Algernon Douglas-Montagu-Scott, and Hugo Max Graf von und zu Lerchenfeld auf Koefering und Schoenberg."""
pattern1 = r"(?:[A-Z][a-z]+[. ]+)+(?:[A-Z][a-z]+)?"
joiners = r"' - de la du von und zu auf van der na di il el bin binte abu etcetera".split()
pattern2 = r"""(?x)
(?:
(?:[ .]|\b%s\b)*
(?:\b[a-z]*[A-Z][a-z]*\b)?
)+
""" % r'\b|\b'.join(joiners)
def get_names(pattern, text):
for m in re.finditer(pattern, text):
s = m.group(0).strip(" .'-")
if s:
yield s
for t in (text1, text2):
print "*** text: ", t[:20], "..."
print "=== Ned B"
for s in re.finditer(pattern1):
print repr(s.group(0))
print "=== John M =="
for name in get_names(pattern2, t):
print repr(name)
輸出:
C:\junk\so>\python26\python extract_names.py
*** text: Conan Doyle said tha ...
=== Ned B
'Conan Doyle '
'Holmes '
'Dr. Joseph Bell'
'Doyle '
'Edinburgh Royal Infirmary. Like Holmes'
'Bell '
'Michael Harrison '
'Ellery Queen'
'Mystery Magazine '
'Wendell Scherer'
'England '
=== John M ==
'Conan Doyle'
'Holmes'
'Dr. Joseph Bell'
'Doyle'
'Edinburgh Royal Infirmary. Like Holmes'
'Bell'
'Michael Harrison'
'Ellery Queen'
'Mystery Magazine'
'Wendell Scherer'
'England'
*** text: Flann O'Brien a.k.a. ...
=== Ned B
'Flann '
'Brien '
'Myles '
'Sukarno '
'Harry '
'Edgar Hoover'
'Joe '
'Algernon Douglas'
'Hugo Max Graf '
'Lerchenfeld '
'Koefering '
'Schoenberg.'
=== John M ==
"Flann O'Brien"
'Myles na cGopaleen'
'I Zingari'
'Sukarno'
'Suharto'
'Harry S. Truman'
'J. Edgar Hoover'
'J. K. Rowling'
"L'Hopital"
'Joe di Maggio'
'Algernon Douglas-Montagu-Scott'
'Hugo Max Graf von und zu Lerchenfeld auf Koefering und Schoenberg'
能否請您具體地指定一個句子短語的情況下是什麼? (而不僅僅是舉例 - 儘管給出的例子當然也是有用的) – 2009-08-27 20:09:57
你說的「判例」究竟是什麼意思? – Triptych 2009-08-27 20:12:35
我認爲OP想知道如何確定/解析哪些單詞應該有一個大寫字母。 – ChrisW 2009-08-27 20:18:36