2014-10-30 127 views
0

我一直試圖從網頁刮(我有權限刮)提取傳輸節點名稱和位置座標字符串。名稱和位置在JavaScript的cdata塊中。在這裏看到:http://pastebin.com/6Vtup2dEPython正則表達式提取Lookahead

在Python中使用正則表達式

re.findall("(?:\(new\sMicrosoft\.Maps\.Location\()(.+?(?=\)\,))(?:.+?(?=new\ssimpleInfo\(\\\'))(.+?(?=\\)))", test_str) 

我得到

[(u'55.86527,-4.2517133', 
    u"new simpleInfo('Buchanan Bus Station','Glasgow, Buchanan Bus Station - entrance to station is situated on Killermont Street. It is a short walk from George Square and within easy reach of Glasgow?s main shopping and leisure areas. Please check the bus station passenger displays for stance information for megabus services.'"), 
(u'55.86068,-4.257852', u"new simpleInfo('Central Train Station',''"), 
(u'51.492653,-0.14765126', 
    u"new simpleInfo('Victoria, Buckingham Palace Rd, Stop 10','London Victoria, Buckingham Palace Road - at the corner of Elizabeth Bridge and diagonally across from the main entrance to Victoria Coach Station. megabus Oxford Tube services leave from Stop 10.'"), 
(u'51.492596,-0.14985295', 
    u"new simpleInfo('Victoria Coach Station','London Victoria Coach Station is situated on Buckingham Palace Rd at the junction with Elizabeth St. megabus services depart from Stands 15-20, located in the departures area of North West terminal '"), 
(u'51.503437,-0.112076715', 
    u"new simpleInfo('Waterloo Train Station','London Waterloo - London Waterloo Station is located on Station Approach, SE1 London - just behind the London Eye. The station is a terminus for trains serving the south-west of England and Eurostar services. Waterloo is the largest station in the UK by area. Its spacious, curved concourse is lined with shops and all the modern amenities.\\n'"), 
(u'51.53062,-0.12585254', 
    u"new simpleInfo('St Pancras International Train Station','For East Midlands Trains services only. London St Pancras International, London - St Pancras Station is located on Pancras Rd NW1 between the national Library and Kings Cross station. The station is the terminus for trains serving East Midlands and South Yorkshire. It is also the new London terminal for the Eurostar services to the continent. Kings Cross St Pancras tube station provides links via the London underground to other London destinations.'"), 
(u'51.52678,-0.13297649', 
    u"new simpleInfo('Euston Train Station','For Virgin Trains Services Only. London Euston - The station is the main terminal for trains to London from the West Midlands and North West England. It is connected to Euston Tube Station for easy access to the London Underground network'"), 
(u'51.52953,-0.12506014', 
    u"new simpleInfo('St Pancras, Coach Road','In some instances megabusplus services which operate as coach only will pick up from Coach Road, outside London St Pancras.'"), 
(u'55.86527,-4.2517133', 
    u"new simpleInfo('Buchanan Bus Station','Glasgow, Buchanan Bus Station - entrance to station is situated on Killermont Street. It is a short walk from George Square and within easy reach of Glasgow?s main shopping and leisure areas. Please check the bus station passenger displays for stance information for megabus services.'"), 
(u'55.86068,-4.257852', u"new simpleInfo('Central Train Station',''")] 

但我想獲得僅僅是:

[(u'55.86527,-4.2517133','Buchanan Bus Station'), 
    (u'55.86068,-4.257852', 'Central Train Station'), 
    (u'51.492653,-0.14765126','Victoria, Buckingham Palace Rd, Stop 10'), 
    (u'51.492596,-0.14985295','Victoria Coach Station')....etc] 

我寫在我的時間大量的正則表達式,但我從來沒有像這樣的問題。我試圖(不管你是否相信)將所有內容隱藏起來,包括「new simpleInfo('),然後抓住字符串直到下一個」'「,但我無法解決它。幫助!

+0

是否需要在一個單一的正則表達式?因爲它看起來像你可以輕鬆地解析它爲幾個搜索/替換後的JSON :) – Wolph 2014-10-30 11:19:43

回答

1

試試這個:。

re.findall(r"(?:\(new\sMicrosoft\.Maps\.Location\(([^)]+)\).+?new\ssimpleInfo\(\\?'(.+?)\\?')", test_str) 

這個正則表示找到所有的出現是否有\'Buchanan Bus Station\''Buchanan Bus Station'

這裏是demo

0
(?:\(new\sMicrosoft\.Maps\.Location\()(.+?(?=\)\,))(?:.+?).*?new\ssimpleInfo\(\\'([^'\\]+) 

嘗試this.This應該給你你想要的東西

import re 
p = re.compile(ur'(?:\(new\sMicrosoft\.Maps\.Location\()(.+?(?=\)\,))(?:.+?).*?new\ssimpleInfo\(\\\'([^\'\\]+)') 
test_str = u"jQuery(function(){ jQuery(\'#JourneyPlanner_txtOutboundDate\').datepicker({dateFormat: \'dd/mm/yy\', firstDay: 1, beforeShowDay: function(dte){ return [((dte >= new Date(2014,9,29) && dte <= new Date(2015,0,4)) || false)]; }, minDate: new Date(2014,9,29), maxDate: new Date(2015,0,4),buttonImage: \"/images/icon_calendar.gif\", showOn: \"both\", buttonImageOnly: true}); });\njQuery(function(){ jQuery(\'#JourneyPlanner_txtReturnDate\').datepicker({dateFormat: \'dd/mm/yy\', firstDay: 1,buttonImage: \"/images/icon_calendar.gif\", showOn: \"both\", buttonImageOnly: true}); });\nEmperorBing.addCallback(function(){ var map = new Microsoft.Maps.Map(document.getElementById(\'confirm1_Map1\'), {credentials:\'Aodb7Wd7D9Kq5gKNryfW6V29yf8aw2Sbu-tXAlkH7OLJtm8zG2bQzzhDKK5zM9FE\',height: 320,width: 299, zoom: 13, mapTypeId: Microsoft.Maps.MapTypeId.auto, enableClickableLogo: false , enableSearchLogo: false , showDashboard: true, showCopyright: true, showScalebar: true, showMapTypeSelector: true});\r\nEmperorBing.addMarker(map, new Microsoft.Maps.Pushpin(new Microsoft.Maps.Location(55.86527,-4.2517133), { undefined: undefined, icon:\'/images/mapmarker.gif\', width:42, height:42, anchor: new Microsoft.Maps.Point(21,21)}),new simpleInfo(\'Buchanan Bus Station\',\'Glasgow, Buchanan Bus Station - entrance to station is situated on Killermont Street. It is a short walk from George Square and within easy reach of Glasgow?s main shopping and leisure areas. Please check the bus station passenger displays for stance information " 

re.findall(p, test_str) 

觀看演示

http://regex101.com/r/dP9rO4/9

+0

這適用於regex101,但代碼返回Python 2.74中的空列表 – Handloomweaver 2014-10-30 12:36:00