2016-02-12 78 views
3

非換空間在我的scrapy蜘蛛選擇標籤,我想只有<p>與文本內容來選擇:Scrapy:與使用XPath

item['Description'] = response.xpath('//*[@id="textepresentation"]//p[string(.)]').extract() 

它工作正常,但不幸的是,這樣做,我也得空<p>與非打破空間

u'<p>\xa0</p>', 

如何避免與XPath的非換空間中選擇<p>

回答

2

可以使用XPath's normalize-space()字符串函數此一對夫婦謂詞:

  • [normalize-space()]讓你得到與非空字符串表示的元素,但不包括開頭和結尾的空白
  • [not(contains(normalize-space(), "\u00a0"))]因爲NO-BREAK SPACE未被刪除(請參見this other answer where I checked which ones work,您可能需要添加其他字符進行測試)

樣品:

>>> import scrapy 
>>> selector = scrapy.Selector(text=u''' 
... <html> 
...  <p>&nbsp;</p> 
...  <p>something</p> 
...  <p> </p> 
...  <p><a href="http://example.com">some link</a></p> 
... </html> 
... ''') 
>>> selector.xpath(u''' 
...  //p[normalize-space()] 
...  [not(contains(normalize-space(), "\u00a0"))] 
... ''').extract() 
[u'<p>something</p>', u'<p><a href="http://example.com">some link</a></p>'] 
>>> 

編輯:

以下的中@ Kimmy的回答,這裏是1個謂詞替代方案,其他空格字符和:

  • 採取空白不能被normalize-space()
  • 替換的字符並將它們放入XPath translate()呼叫與'
  • 正常化的空間,修剪開頭和結尾的那些

這裏有雲:

>>> chars = ''' 
... #CHARACTER TABULATION 
... #LINE FEED 
... #LINE TABULATION 
... #FORM FEED 
... #CARRIAGE RETURN 
... #SPACE 
... #NEXT LINE 
... NO-BREAK SPACE 
... OGHAM SPACE MARK 
... MONGOLIAN VOWEL SEPARATOR 
... EN QUAD 
... EM QUAD 
... EN SPACE 
... EM SPACE 
... THREE-PER-EM SPACE 
... FOUR-PER-EM SPACE 
... SIX-PER-EM SPACE 
... FIGURE SPACE 
... PUNCTUATION SPACE 
... THIN SPACE 
... HAIR SPACE 
... ZERO WIDTH SPACE 
... ZERO WIDTH NON-JOINER 
... ZERO WIDTH JOINER 
... LINE SEPARATOR 
... PARAGRAPH SEPARATOR 
... NARROW NO-BREAK SPACE 
... MEDIUM MATHEMATICAL SPACE 
... WORD JOINER 
... IDEOGRAPHIC SPACE 
... ZERO WIDTH NO-BREAK SPACE 
... ''' 
>>> import unicodedata 
>>> wsp = [unicodedata.lookup(c) 
...  for c in chars.splitlines() 
...  if c.strip() and not c.startswith('#')] 
>>> 
>>> # somehow NEXT LINE (U+0085) does not work with unicodedata 
... wsp.append(u'\u0085') 
>>> 
>>> selector.xpath(u''' 
...  //p[normalize-space(translate(., "%(in)s", "%(out)s"))] 
...  ''' % {'in': ''.join(wsp), 
...   'out': ' '*len(wsp) 
...  }).extract() 
[u'<p>something</p>', u'<p><a href="http://example.com">some link</a></p>'] 
>>> 
+0

謝謝你這個有價值的詳細解釋!它按預期工作。謝謝 ! – jacquesseite

0
//p[translate(string(.),"\xa0","")] 
+0

不錯的嘗試,但'項目[ '說明'] = response.xpath(」 // * [@ id =「textepresentation」] // p [translate(string(。),'\ xa0','')]')。extract() SyntaxError:行結束符後續字符' – jacquesseite

+0

@jacquesseite字符串分隔符衝突。在XPath表達式中始終使用雙引號,即translate(string(。),「\ xa0」,「」)' – har07

+0

編輯爲使用雙引號。 –