如何提取文本以及scrapy中的超鏈接文本？

我想從下面的HTML代碼中提取：如何提取文本以及scrapy中的超鏈接文本？

<li> 
    <a test="test" href="abc.html" id="11">Click Here</a> 
    "for further reference" 
</li>

我試着用以下提取命令

response.css("article div#section-2 li::text").extract()

做但也僅僅是給「備查」行和預期輸出是「單擊此處以供進一步參考」作爲一個字符串。如何做到這一點？如何修改這一點，如果下面的模式是做同樣的還有：

文本超鏈接文本
超鏈接文本
文本超鏈接

來源

2017-04-10 Shubham B.

你可以嘗試的XPath選擇'response.xpath（ '//文/ DIV [@ ID = 「部分2」] /李//文本（）'）。提取物（）' – vold

至少有幾種方法可以做到這一點：

讓我們先建立一個測試選擇模仿你的迴應：

>>> response = scrapy.Selector(text="""<li> 
...  <a test="test" href="abc.html" id="11">Click Here</a> 
...  "for further reference" 
... </li>""")

第一種選擇，對於小的修改，以你的CSS選擇器。看所有文字後裔，不僅文字兒童（注意li和::text僞元素之間的空間）：

# this is your CSS select, 
# which only gives direct children text of your selected LI 
>>> response.css("li::text").extract()  
[u'\n ', u'\n "for further reference"\n'] 

# notice the extra space 
#     here 
#     | 
#     v 
>>> response.css("li ::text").extract() 
[u'\n ', u'Click Here', u'\n "for further reference"\n'] 

# using Python's join() to concatenate and build the full sentence 
>>> ''.join(response.css("li ::text").extract()) 
u'\n Click Here\n "for further reference"\n'

另一種選擇是鏈中的.css()通話使用XPath 1.0 string()或normalize-space()後續.xpath()調用內部：

>>> response.css("li").xpath('string()').extract() 
[u'\n Click Here\n "for further reference"\n'] 
>>> response.css("li").xpath('normalize-space()').extract() 
[u'Click Here "for further reference"'] 

# calling `.extract_first()` gives you a string directly, not a list of 1 string 
>>> response.css("li").xpath('normalize-space()').extract_first() 
u'Click Here "for further reference"'

來源

2017-04-10 13:20:39

謝謝保羅它非常有幫助。但我有一個問題，我提取的html頁面有很多列表。像

....

...

，所以當他們加入他們所有人都concabinated insted一個特定的列表內容。那麼如何分離列表內容並僅在一個列表項中加入文本和超鏈接呢？ –

試試這個：'for list_item in response.css（'li'）：print（''。join（list_item.css（':: text'）。extract（））） –

我使用XPath如果是這樣的情況下，選擇將是：

response.xpath('//article/div[@id="section-2"]/li/a/text()').extract()#this will give you text of mentioned hyper link >> "Click Here" 
response.xpath('//article/div[@id="section-2"]/li/a/@href').extract()#this will give you link of mentioned hyper link >> "abc.html" 
response.xpath('//article/div[@id="section-2"]/li/text()').extract()#this will give you text of li >> "for further reference"

來源

2017-04-10 13:07:37 Mani

如何提取文本以及scrapy中的超鏈接文本？

回答

相關問題