2016-11-06 115 views
4

我試圖從下面的段落結構提取這種類型的信息:NLP - 在Python(spaCy)信息提取

women_ran men_ran kids_ran walked 
     1  2  1  3 
     2  4  3  1 
     3  6  5  2 

text = ["On Tuesday, one women ran on the street while 2 men ran and 1 child ran on the sidewalk. Also, there were 3 people walking.", "One person was walking yesterday, but there were 2 women running as well as 4 men and 3 kids running.", "The other day, there were three women running and also 6 men and 5 kids running on the sidewalk. Also, there were 2 people walking in the park."] 

我使用Python的spaCy我的NLP圖書館。我更新NLP的工作,並希望得到一些指導,以便從這些句子中提取這些表格信息的最佳方式是什麼。

如果僅僅是確定是否有個人跑步或行走,我只是使用sklearn來適應分類模型,但我需要提取的信息顯然比這更細化(我試圖檢索每個子類別和值)。任何指導將不勝感激。

回答

7

你會想爲此使用依賴分析。您可以使用the displaCy visualiser查看您的例句的可視化。

你可以實現你需要幾個不同的方式的規則 - 就像如何總有多種方式來編寫XPath查詢,DOM選擇等

像這樣的東西應該工作:

nlp = spacy.load('en') 
docs = [nlp(t) for t in text] 
for i, doc in enumerate(docs): 
    for j, sent in enumerate(doc.sents): 
     subjects = [w for w in sent if w.dep_ == 'nsubj'] 
     for subject in subjects: 
      numbers = [w for w in subject.lefts if w.dep_ == 'nummod'] 
      if len(numbers) == 1: 
       print('document.sentence: {}.{}, subject: {}, action: {}, numbers: {}'.format(i, j, subject.text, subject.head.text, numbers[0].text)) 

對於text你的例子你應該:

document.sentence: 0.0, subject: men, action: ran, numbers: 2 
document.sentence: 0.0, subject: child, action: ran, numbers: 1 
document.sentence: 0.1, subject: people, action: walking, numbers: 3 
document.sentence: 1.0, subject: person, action: walking, numbers: One 
+0

我沒寫過一個XPath查詢或DOM選擇。你能解釋一下嗎? – kathystehl

+1

@kathystehl XPath指定XML(HTML)文檔中的位置。所以XPath查詢是一種在XML或HTML中查找特定元素的方法。參見[wikipedia](https://en.wikipedia.org/wiki/XPath)。 DOM選擇器是HTML文檔中的任何CSS元素'id'或'class'(DOM是您在javascript中使用的HTML/XML文檔/樹的數據結構等)。所以你可以通過id和class來篩選元素。在NLP中,依賴關係解析器將非結構化文本轉換爲類似於HTML的樹數據結構,其中的標記可以像DOM選擇器過濾器和XPath查詢一樣進行查詢。 – hobs