如何在Python中使用NLP，RegEx查找句子中的日期

任何人都可以爲我提供一些查找和解析日期的方法（任何格式，「Aug06」，「Aug2006」，「2008年8月2日」，「2006年8月19日「，」08-06「，」01-08-06「）。如何在Python中使用NLP，RegEx查找句子中的日期

我碰到這個問題，但它是用Perl ... Extract inconsistently formatted date from string (date parsing, NLP)

任何建議將是有益的。

來源

2010-09-28 Software Enthusiastic

from dateutil import parser 


texts = ["Aug06", "Aug2006", "August 2 2008", "19th August 2006", "08-06", "01-08-06"] 
for text in texts: 
    print text, parser.parse(text) 


Aug06   2010-08-06 00:00:00 
Aug2006   2006-08-28 00:00:00 
August 2 2008 2008-08-02 00:00:00 
19th August 2006 2006-08-19 00:00:00 
08-06   2010-08-06 00:00:00 
01-08-06   2006-01-08 00:00:00

如果你想找到一個較長的文本這些日期，然後嘗試搜索數字和個月的團體，並試圖給他們這個解析器。如果文本看起來不像日期，它會拋出異常。

months = ['January', 'February',...] 
months.extend([mon[:3] for mon in months]) 

# search for numeric dates: 
/[\d \-]+/ 

# search for dates: 
for word in sentence.split(): 
    if word in months: 
     ...

來源

2010-09-28 05:47:04 eumiro

這不是一個通用的解決方案。 – 2010-09-28 06:01:07

人們希望有一種簡單的方法來關閉「用當前值填補空白」的跳躍......「Aug2008」 - >「2006-08-28」僅僅因爲今天是本月28日是一點點boggler – 2010-09-28 06:01:45

@anand：但他已經回答了問題的一部分 - 如何解析日期。 – 2010-09-28 06:17:35

此找到所有的日期在你的例句：

for match in re.finditer(
    r"""(?ix)    # case-insensitive, verbose regex 
    \b     # match a word boundary 
    (?:     # match the following three times: 
    (?:     # either 
     \d+     # a number, 
     (?:\.|st|nd|rd|th)* # followed by a dot, st, nd, rd, or th (optional) 
     |     # or a month name 
     (?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*) 
    ) 
    [\s./-]*    # followed by a date separator or whitespace (optional) 
    ){3}     # do this three times 
    \b     # and end at a word boundary.""", 
    subject): 
    # match start: match.start() 
    # match end (exclusive): match.end() 
    # matched text: match.group()

這絕對不是完美的，容易錯過一些時間（特別是如果他們不是英語 - 21. Mai 2006會失敗，以及4ème décembre 1999），並匹配像August Augst Aug這樣的無稽之談，但由於在示例中幾乎所有內容都是可選的，因此在正則表達式級別上可以做的事情不多。

下一步是將所有匹配的內容提供給解析器，看看它是否可以將它們解析爲合理的日期。

正則表達式無法正確解釋上下文。設想一個（愚蠢的）文字，如You'll find it in box 21. August 3rd will be the shipping date.它將匹配21. August 3rd這當然不能被解析。

來源

2010-09-28 06:50:14

如何在Python中使用NLP，RegEx查找句子中的日期

回答

相關問題