2017-10-11 75 views
-2

我試圖用scrapy來提取網頁中的數據......,所有的數據是內部一個javascriptscrapy沒有JSON對象可以解碼

<script type="text/javascript"> 
// Globals 
var ANUNTURI = [ { "ID": "2750801", "Data": "Azi 11:16", "Zile_piata": "146", "Zona": "Andronache", "Nr_Camere": "2", "suprafu": "65", "Pret": "62.000 EUR", 
    "Citit": "0", "Tip_teren": "-", "Etaj": "3/3", "supraft": "-", 
     "frontStradal": "-", "Etichete": "", "ArePoze": "7", "Tip_spatiu": "-" },   
and so on... ] 

;\r\n var ID_CAUTARE = 0;\r\n var CATEG = 3;\r\n  
var TRANZ = 2;\r\n  
var SORTARE = "";\r\n  
var ID_AGENT = "3012";\r\n  
var ID_LOCALITATE = \'13822\';\r\n  
var ID_JUDET = \'10\';\r\n  
var CRITERIU_FILTRU = \'\';\r\n  // judet_schimbat = "";\r\n\r\n $(\'form[name="anunturi"] input[name="sort"]\').val(SORTARE);\r\n\r\n', u"\r\n\r\n $(function(){\r\n\r\n   
var setTagValue = ' 0 ';\r\n   
var comboTitle = [];\r\n\r\n  $('#combo_etichete').mpCombo({\r\n   cls: 'mpCombo etichete',\r\n   header_default_text: 'Indiferent',\r\n   interval_from_text: ' Peste ', \r\n    
interval_to_text: ' Pana la ', \r\n   interval_between_text: ' si ', \r\n   combo_width: '162px', \r\n   menu_width: '160px',\r\n   onSelect: function() { // trigger click daca e inchisa cautarea avansata\r\n    if($('#cautare_avansata').is(':hidden')) {\r\n     $('a#filtreaza').trigger('click');\r\n    }\r\n   }\r\n\r\n  });\r\n  \r\n  $('#combo_etichete').mpCombo({'setval': setTagValue});\r\n  comboTitle.push($('#combo_etichete').mpCombo('gettitle')); \r\n\r\n  if (comboTitle.length > 0) {\r\n   $('#combo_etichete dt a').text(comboTitle.join(', '));  \r\n  }\r\n\r\n });\r\n\r\n\r\n", u'\r\nvar gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");\r\ndocument.write(unescape("%3Cscript src=\'" + gaJsHost + "google-analytics.com/ga.js\' type=\'text/javascript\'%3E%3C/script%3E"));\r\n'] 
</script> 

當我使用

json.loads(response.xpath("//script[2]/text").extract()) 

它給我的錯誤

No Json object could be decoded

我只需要得到第一個VAR ANUNTURI和它裏面的一切,並把它們放在mysql中。

UPDATE

我也試過這樣:

var = re.compile(r"var ANUNTURI= ({.*?});", re.MULTILINE | re.DOTALL) 
json.loads(response.xpath("//script[2][contains(., 'var ANUNTURI')]/text()").re(var)) 

和錯誤,我得到的是這樣的:

TypeError: expected string or buffer

,然後我嘗試這樣做:

json.loads("".join(response.xpath("//script[2][contains(., 'var ANUNTURI')]/text()").re(var))) 

,我也得到:

NO JSON object could be decoded

+0

因此沒有辦法提取該數據? – Omega

+0

正則表達式? –

+0

你甚至從該JS代碼中提取該JSON嗎? – Umair

回答

1

這是一個可能的方式來提取數據,但當前提出的代碼,它是很難說,如果變量嵌入JSON或Javacript。以微妙的方式使用Javascript可能是JSON對象的超集。

data = """/ Globals 
var ANUNTURI = [ { "ID": "2750801", "Data": "Azi 11:16", "Zile_piata": "146", "Zona": "Andronache", "Nr_Camere": "2", "suprafu": "65", "Pret": "62.000 EUR", 
    "Citit": "0", "Tip_teren": "-", "Etaj": "3/3", "supraft": "-", 
     "frontStradal": "-", "Etichete": "", "ArePoze": "7", "Tip_spatiu": "-" },] 

;\r\n var ID_CAUTARE = 0;\r\n var CATEG = 3;\r\n  
var TRANZ = 2;\r\n  
var SORTARE = "";\r\n  
var ID_AGENT = "3012";\r\n  
var ID_LOCALITATE = \'13822\';\r\n  
var ID_JUDET = \'10\';\r\n  
var CRITERIU_FILTRU = \'\';\r\n  // judet_schimbat = "";\r\n\r\n $(\'form[name="anunturi"] input[name="sort"]\').val(SORTARE);\r\n\r\n', u"\r\n\r\n $(function(){\r\n\r\n   
var setTagValue = ' 0 ';\r\n   
var comboTitle = [];\r\n\r\n  $('#combo_etichete').mpCombo({\r\n   cls: 'mpCombo etichete',\r\n   header_default_text: 'Indiferent',\r\n   interval_from_text: ' Peste ', \r\n    
interval_to_text: ' Pana la ', \r\n   interval_between_text: ' si ', \r\n   combo_width: '162px', \r\n   menu_width: '160px',\r\n   onSelect: function() { // trigger click daca e inchisa cautarea avansata\r\n    if($('#cautare_avansata').is(':hidden')) {\r\n     $('a#filtreaza').trigger('click');\r\n    }\r\n   }\r\n\r\n  });\r\n  \r\n  $('#combo_etichete').mpCombo({'setval': setTagValue});\r\n  comboTitle.push($('#combo_etichete').mpCombo('gettitle')); \r\n\r\n  if (comboTitle.length > 0) {\r\n   $('#combo_etichete dt a').text(comboTitle.join(', '));  \r\n  }\r\n\r\n });\r\n\r\n\r\n", u'\r\nvar gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");\r\ndocument.write(unescape("%3Cscript src=\'" + gaJsHost + "google-analytics.com/ga.js\' type=\'text/javascript\'%3E%3C/script%3E"));\r\n' 
""" 
from json import loads 
from pprint import PrettyPrinter 
lines = data.split("\r\n") 
anunturi_json = lines[0].split("=")[1] 
print anunturi_json 
val = loads(anunturi_json) 
pp = PrettyPrinter(indent=4) 
pp.pprint(val) 
+0

謝謝你的工作......唯一的事情,我不得不做額外的......我不得不提取**; **從json – Omega

+0

年底你能告訴我怎麼樣只從該「var Anunturi」中提取「ID」? – Omega

+2

@Omega'val.get('ID')' –

相關問題