2011-11-20 111 views
9

我正在嘗試爲足球比賽提供解析器。我在這裏非常鬆散地使用「自然語言」這個詞,所以請耐心等待,因爲我對這個領域一無所知。自然語言解析器,用於解析體育比賽數據

這裏是什麼我用 工作的一些例子(格式:TIME | DOWN & DIST | OFF_TEAM |說明):

04:39|4th and [email protected]|Dal|Mat McBriar punts for 32 yards to NYJ14. Jeremy Kerley - no return. FUMBLE, recovered by NYJ.| 
04:31|1st and [email protected]|NYJ|Shonn Greene rush up the middle for 5 yards to the NYJ21. Tackled by Keith Brooking.| 
03:53|2nd and [email protected]|NYJ|Mark Sanchez rush to the right for 3 yards to the NYJ24. Tackled by Anthony Spencer. FUMBLE, recovered by NYJ (Matthew Mulligan).| 
03:20|1st and [email protected]|NYJ|Shonn Greene rush to the left for 4 yards to the NYJ37. Tackled by Jason Hatcher.| 
02:43|2nd and [email protected]|NYJ|Mark Sanchez pass to the left to Shonn Greene for 7 yards to the NYJ44. Tackled by Mike Jenkins.| 
02:02|1st and [email protected]|NYJ|Shonn Greene rush to the right for 1 yard to the NYJ45. Tackled by Anthony Spencer.| 
01:23|2nd and [email protected]|NYJ|Mark Sanchez pass to the left to LaDainian Tomlinson for 5 yards to the 50. Tackled by Sean Lee.| 

截至目前,我寫了一個愚蠢的解析器,手柄所有簡單的東西(playid,quarter,time,down &距離,攻擊性團隊)以及一些腳本,並獲取這些數據並將其清理成上述格式。一條線變成一個「Play」對象存儲到數據庫中。

最困難的部分在這裏(至少對我來說)是解析該劇的描述。下面是一些我想從該字符串中提取信息:

例字符串:

"Mark Sanchez pass to the left to Shonn Greene for 7 yards to the NYJ44. Tackled by Mike Jenkins." 

結果:

turnover = False 
interception = False 
fumble = False 
to_on_downs = False 
passing = True 
rushing = False 
direction = 'left' 
loss = False 
penalty = False 
scored = False 
TD = False 
PA = False 
FG = False 
TPC = False 
SFTY = False 
punt = False 
kickoff = False 
ret_yardage = 0 
yardage_diff = 7 
playmakers = ['Mark Sanchez', 'Shonn Greene', 'Mike Jenkins'] 

,我有我的最初的解析器邏輯去這樣的事情:

# pass, rush or kick 
# gain or loss of yards 
# scoring play 
    # Who scored? off or def? 
    # TD, PA, FG, TPC, SFTY? 
# first down gained 
# punt? 
# kick? 
    # return yards? 
# penalty? 
    # def or off? 
# turnover? 
    # INT, fumble, to on downs? 
# off play makers 
# def play makers 

描述可以變得很毛(多個冒泡&罰款回收等),我想知道我是否可以利用一些NLP模塊。很有可能我會花數天時間在一個愚蠢的/靜態的機器上,比如解析器,但是如果有人對如何使用NLP技術來處理它有所建議,我想聽聽他們。

+9

鑑於問題的主題,我覺得有趣的是,SO語法突出顯示器突出顯示了所有的人名... – Jon

回答

4

我認爲pyparsing在這裏非常有用。

你的輸入文本看起來很規則(不像真正的自然語言),pyparsing在這個東西很棒。你應該看看它。

例如解析下面的句子:

Mat McBriar punts for 32 yards to NYJ14. 
Mark Sanchez rush to the right for 3 yards to the NYJ24. 

你會定義一個分析句子的東西,如(尋找在文檔確切的語法):

name = Group(Word(alphas) + Word(alphas)).setResultsName('name') 

action = Or(Exact("punts"),Exact("rush")).setResultsName('action') + Optional(Exact("to the")) + Or(Exact("left"), Exact("right"))) 

distance = Word(number).setResultsName("distance") + Exact("yards") 

pattern = name + action + Exact("for") + distance + Or(Exact("to"), Exact("to the")) + Word() 

而且pyparsing將採用打破串這種模式。它還會返回一個包含項目名稱,操作和距離的字典 - 從句子中提取。

+0

我會檢查出來並報告回來,謝謝。 – Jon

0

我想象pyparsing會工作得很好,但基於規則的系統是相當脆弱。所以,如果你超越了足球,你可能會遇到一些麻煩。

我認爲對於這種情況更好的解決方案將是語音標記器和玩家名稱,位置和其他運動術語的詞典(閱讀字典)的一部分。將它轉儲到您最喜歡的機器學習工具中,找出好的功能,我認爲它會做的很好。

NTLK是開始爲NLP的好地方。不幸的是,這個領域不是很發達,沒有一個工具可以解決問題,很容易。