非換空間在我的scrapy蜘蛛選擇標籤,我想只有<p>
與文本內容來選擇:Scrapy:與使用XPath
item['Description'] = response.xpath('//*[@id="textepresentation"]//p[string(.)]').extract()
它工作正常,但不幸的是,這樣做,我也得空<p>
與非打破空間
u'<p>\xa0</p>',
如何避免與XPath的非換空間中選擇<p>
?
非換空間在我的scrapy蜘蛛選擇標籤,我想只有<p>
與文本內容來選擇:Scrapy:與使用XPath
item['Description'] = response.xpath('//*[@id="textepresentation"]//p[string(.)]').extract()
它工作正常,但不幸的是,這樣做,我也得空<p>
與非打破空間
u'<p>\xa0</p>',
如何避免與XPath的非換空間中選擇<p>
?
可以使用XPath's normalize-space()
字符串函數此一對夫婦謂詞:
[normalize-space()]
讓你得到與非空字符串表示的元素,但不包括開頭和結尾的空白[not(contains(normalize-space(), "\u00a0"))]
因爲NO-BREAK SPACE
未被刪除(請參見this other answer where I checked which ones work,您可能需要添加其他字符進行測試)樣品:
>>> import scrapy
>>> selector = scrapy.Selector(text=u'''
... <html>
... <p> </p>
... <p>something</p>
... <p> </p>
... <p><a href="http://example.com">some link</a></p>
... </html>
... ''')
>>> selector.xpath(u'''
... //p[normalize-space()]
... [not(contains(normalize-space(), "\u00a0"))]
... ''').extract()
[u'<p>something</p>', u'<p><a href="http://example.com">some link</a></p>']
>>>
編輯:
以下的中@ Kimmy的回答,這裏是1個謂詞替代方案,其他空格字符和:
normalize-space()
translate()
呼叫與'這裏有雲:
>>> chars = '''
... #CHARACTER TABULATION
... #LINE FEED
... #LINE TABULATION
... #FORM FEED
... #CARRIAGE RETURN
... #SPACE
... #NEXT LINE
... NO-BREAK SPACE
... OGHAM SPACE MARK
... MONGOLIAN VOWEL SEPARATOR
... EN QUAD
... EM QUAD
... EN SPACE
... EM SPACE
... THREE-PER-EM SPACE
... FOUR-PER-EM SPACE
... SIX-PER-EM SPACE
... FIGURE SPACE
... PUNCTUATION SPACE
... THIN SPACE
... HAIR SPACE
... ZERO WIDTH SPACE
... ZERO WIDTH NON-JOINER
... ZERO WIDTH JOINER
... LINE SEPARATOR
... PARAGRAPH SEPARATOR
... NARROW NO-BREAK SPACE
... MEDIUM MATHEMATICAL SPACE
... WORD JOINER
... IDEOGRAPHIC SPACE
... ZERO WIDTH NO-BREAK SPACE
... '''
>>> import unicodedata
>>> wsp = [unicodedata.lookup(c)
... for c in chars.splitlines()
... if c.strip() and not c.startswith('#')]
>>>
>>> # somehow NEXT LINE (U+0085) does not work with unicodedata
... wsp.append(u'\u0085')
>>>
>>> selector.xpath(u'''
... //p[normalize-space(translate(., "%(in)s", "%(out)s"))]
... ''' % {'in': ''.join(wsp),
... 'out': ' '*len(wsp)
... }).extract()
[u'<p>something</p>', u'<p><a href="http://example.com">some link</a></p>']
>>>
//p[translate(string(.),"\xa0","")]
不錯的嘗試,但'項目[ '說明'] = response.xpath(」 // * [@ id =「textepresentation」] // p [translate(string(。),'\ xa0','')]')。extract() SyntaxError:行結束符後續字符' – jacquesseite
@jacquesseite字符串分隔符衝突。在XPath表達式中始終使用雙引號,即translate(string(。),「\ xa0」,「」)' – har07
編輯爲使用雙引號。 –
謝謝你這個有價值的詳細解釋!它按預期工作。謝謝 ! – jacquesseite