2014-09-02 78 views
1

我有一個解析以下日期的工作正則表達式:Scrapy日期捕獲正則表達式

(?:(September|April|June|November) +(0?[1-9]|[12]\d|30), *((?:19|20)\d\d))|(?:(January|March|May|July|August|October|December) +(0?[1-9]|[12]\d|3[01]), *((?:19|20)\d\d))|(?:February +(?:(?:(0?[1-9]|1\d|2[0-8]), *((?:19|20)\d\d))|(?:(29), *((?:(?:19|20)(?:04|08|12|16|20|24|28|32|36|40|44|48|52|56|60|64|68|72|76|80|84|88|92|96))|2000)))) 

它解析以下字符串:

The owners of this address received a permit on Wednesday, July 31, 2014 

項目的輸出scrapy是:

[u'June', u'31', u'2014', u'', u'', u'', u'', u'', u'', u''] 

我想scrapy項目是:

[u'June 31, 2014'] 

這裏是我的scrapy代碼:

date_scrape = response.css('#ctl00_MasterDiv > div.Divwidth100 td.content_panel_middle > div > p:contains("The owners of this address") > b ::text') 

permit_date = date_scrape.re(r'(?:(September|April|June|November) +(0?[1-9]|[12]\d|30), *((?:19|20)\d\d))|(?:(January|March|May|July|August|October|December) +(0?[1-9]|[12]\d|3[01]), *((?:19|20)\d\d))|(?:February +(?:(?:(0?[1-9]|1\d|2[0-8]), *((?:19|20)\d\d))|(?:(29), *((?:(?:19|20)(?:04|08|12|16|20|24|28|32|36|40|44|48|52|56|60|64|68|72|76|80|84|88|92|96))|2000))))') 

就如何解決這一問題有什麼想法?

+0

注 - 我已經嘗試添加^和$來表達我似乎無法弄清楚。我已經在regex101中測試了^和$的幾種可能的用法,它們都失敗了。 – dfriestedt 2014-09-02 13:29:00

回答

1
import re 
s='The owners of this address received a permit on Wednesday, July 31, 2014' 

words = (re.findall(r'(\w+ \d+, \d+)',s)) 
print words 

結果:

['July 31, 2014'] 
+0

我絕對浪費了很多時間,試圖弄清楚這一點。我在其他帖子中看到了這個解決方案,只是沒有嘗試。思想太「簡單」了。謝謝! – dfriestedt 2014-09-02 13:45:27

+0

我很高興!好的com – Kasramvd 2014-09-02 13:46:12

1

如果你不想潛入正則表達式的美妙世界,這裏有一個替代解決方案。

使用dateutil.parser.parse()fuzzy=True。從scrapy shell演示:

$ scrapy shell index.html 
>>> text = response.xpath('//body/b/text()').extract()[0] 
>>> text 
u'The owners of this address received a permit on Wednesday, July 31, 2014' 

>>> from dateutil.parser import parse 
>>> parse(text, fuzzy=True) 
datetime.datetime(2014, 7, 31, 0, 0) 

其中index.html包含HTML測試數據:

<body> 
    <b>The owners of this address received a permit on Wednesday, July 31, 2014</b> 
</body>