如何選擇和提取兩個元素之間的文本？

我想抓scrapy this使用scrapy的網站。頁面結構如下：如何選擇和提取兩個元素之間的文本？

<div class="list"> 
    <a id="follows" name="follows"></a> 
<h4 class="li_group">Follows</h4> 
<div class="soda odd"><a href="...">Star Trek</a></div> 
<div class="soda even"><a href="...</a></div> 
<div class="soda odd"><a href="..">Star Trek: The Motion Picture</a></div> 
<div class="soda even"><a href="..">Star Trek II: The Wrath of Khan</a></div> 
<div class="soda odd"><a href="..">Star Trek III: The Search for Spock</a></div> 
<div class="soda even"><a href="..">Star Trek IV: The Voyage Home</a></div> 
    <a id="followed_by" name="followed_by"></a> 
<h4 class="li_group">Followed by</h4> 
<div class="soda odd"><a href="..">Star Trek V: The Final Frontier</a></div> 
<div class="soda even"><a href="..">Star Trek VI: The Undiscovered Country</a></div> 
<div class="soda odd"><a href="..">Star Trek: Deep Space Nine</a></div> 
<div class="soda even"><a href="..">Star Trek: Generations</a></div> 
<div class="soda odd"><a href="..">Star Trek: Voyager</a></div> 
<div class="soda even"><a href="..">First Contact</a></div> 
    <a id="spin_off" name="spin_off"></a> 
<h4 class="li_group">Spin-off</h4> 
<div class="soda odd"><a href="..">Star Trek: The Next Generation - The Transinium Challenge</a></div> 
<div class="soda even"><a href="..">A Night with Troi</a></div> 
<div class="soda odd"><a href="..">Star Trek: Deep Space Nine</a></div 
</div>

我想選擇之間提取文本：<h4 class="li_group">Follows</h4>和<h4 class="li_group">Followed by</h4>然後<h4 class="li_group">Followed by</h4>和<h4 class="li_group">Spin-off</h4>
之間的文本我想這個代碼：

def parse(self, response): 
    for sel in response.css("div.list"): 
     item = ImdbcoItem() 
     item['Follows'] = sel.css("a#follows+h4.li_group ~ div a::text").extract(), 
     item['Followed_by'] = sel.css("a#vfollowed_by+h4.li_group ~ div a::text").extract(), 
     item['Spin_off'] = sel.css("a#spin_off+h4.li_group ~ div a::text").extract(), 
    return item

但是這個第一個項目提取的所有div不僅僅是div的<h4 class="li_group">Follows</h4>和<h4 class="li_group">Followed by</h4>
之間的任何幫助真的會Helpfu升！

來源

2017-08-30 haben

只是它幫助的情況下，imdb.com有一個（UN）官方的API在哪裏？如果我記得好的話，你可以把所有這些數據清理乾淨。 – Neil

我喜歡使用對於這些情況的提取圖案是：

環過來的「邊界」（這裏，h4元素）
而列舉它們從1
使用XPath的following-sibling軸，就像在@Andersson的答案中一樣，在下一個邊界之前獲取元素，
和通過計算前面的「邊界」的元素個數過濾它們，因爲我們從枚舉知道我們在哪裏

這將是循環：

$ scrapy shell 'http://www.imdb.com/title/tt0092455/trivia?tab=mc&ref_=tt_trv_cnn' 
(...) 
>>> for cnt, h4 in enumerate(response.css('div.list > h4.li_group'), start=1): 
...  print(cnt, h4.xpath('normalize-space()').get()) 
... 
1 Follows  
2 Followed by  
3 Edited into  
4 Spun-off from  
5 Spin-off  
6 Referenced in  
7 Featured in  
8 Spoofed in

這是使用的一個例子枚舉得到邊界之間的元素（注意，在表達$cnt並通過cnt=cnt在.xpath()這種使用XPath的變量）：

>>> for cnt, h4 in enumerate(response.css('div.list > h4.li_group'), start=1): 
...  print(cnt, h4.xpath('normalize-space()').get()) 
...  print(h4.xpath('following-sibling::div[count(preceding-sibling::h4)=$cnt]', 
         cnt=cnt).xpath(
          'string(.//a)').getall()) 
... 
1 Follows  
['Star Trek', 'Star Trek: The Animated Series', 'Star Trek: The Motion Picture', 'Star Trek II: The Wrath of Khan', 'Star Trek III: The Search for Spock', 'Star Trek IV: The Voyage Home'] 
2 Followed by  
['Star Trek V: The Final Frontier', 'Star Trek VI: The Undiscovered Country', 'Star Trek: Deep Space Nine', 'Star Trek: Generations', 'Star Trek: Voyager', 'First Contact', 'Star Trek: Insurrection', 'Star Trek: Enterprise', 'Star Trek: Nemesis', 'Star Trek', 'Star Trek Into Darkness', 'Star Trek Beyond', 'Star Trek: Discovery', 'Untitled Star Trek Sequel'] 
3 Edited into  
['Reading Rainbow: The Bionic Bunny Show', 'The Unauthorized Hagiography of Vincent Price'] 
4 Spun-off from  
['Star Trek'] 
5 Spin-off  
['Star Trek: The Next Generation - The Transinium Challenge', 'A Night with Troi', 'Star Trek: Deep Space Nine', "Star Trek: The Next Generation - Future's Past", 'Star Trek: The Next Generation - A Final Unity', 'Star Trek: The Next Generation: Interactive VCR Board Game - A Klingon Challenge', 'Star Trek: Borg', 'Star Trek: Klingon', 'Star Trek: The Experience - The Klingon Encounter'] 
6 Referenced in  
(...)

這裏是你如何可以用它來填充和項目（在這裏，我用一個簡單的字典只是爲了舉例說明）：

>>> item = {} 
>>> for cnt, h4 in enumerate(response.css('div.list > h4.li_group'), start=1): 
...  key = h4.xpath('normalize-space()').get().strip() # there are some non-breaking spaces 
...  if key in ['Follows', 'Followed by', 'Spin-off']: 
...   values = h4.xpath('following-sibling::div[count(preceding-sibling::h4)=$cnt]', 
...      cnt=cnt).xpath(
...       'string(.//a)').getall() 
...   item[key] = values 
... 

>>> from pprint import pprint 
>>> pprint(item) 
{'Followed by': ['Star Trek V: The Final Frontier', 
       'Star Trek VI: The Undiscovered Country', 
       'Star Trek: Deep Space Nine', 
       'Star Trek: Generations', 
       'Star Trek: Voyager', 
       'First Contact', 
       'Star Trek: Insurrection', 
       'Star Trek: Enterprise', 
       'Star Trek: Nemesis', 
       'Star Trek', 
       'Star Trek Into Darkness', 
       'Star Trek Beyond', 
       'Star Trek: Discovery', 
       'Untitled Star Trek Sequel'], 
'Follows': ['Star Trek', 
      'Star Trek: The Animated Series', 
      'Star Trek: The Motion Picture', 
      'Star Trek II: The Wrath of Khan', 
      'Star Trek III: The Search for Spock', 
      'Star Trek IV: The Voyage Home'], 
'Spin-off': ['Star Trek: The Next Generation - The Transinium Challenge', 
       'A Night with Troi', 
       'Star Trek: Deep Space Nine', 
       "Star Trek: The Next Generation - Future's Past", 
       'Star Trek: The Next Generation - A Final Unity', 
       'Star Trek: The Next Generation: Interactive VCR Board Game - A ' 
       'Klingon Challenge', 
       'Star Trek: Borg', 
       'Star Trek: Klingon', 
       'Star Trek: The Experience - The Klingon Encounter']} 
>>>

來源

2017-08-30 10:33:46

謝謝，作品像魅力。但我無法弄清楚如何在我的代碼中使用它。你能給我一個提示或提供給我一個代碼來使用嗎？ – haben

看到我編輯的答案。 –

您可以嘗試使用下面的XPath表達式爲獲取

所有文本節點「跟隨」塊：「其次是」塊
```
//div[./preceding-sibling::h4[1]="Follows"]//text() 
```

所有文本節點：

//div[./preceding-sibling::h4[1]="Followed by"]//text()

「分離」塊的所有文本節點：

//div[./preceding-sibling::h4[1]="Spin-off"]//text()

來源

2017-08-30 10:14:03 Andersson

你甚至可以簡化'[./preceding-sibling::h4[1] [。=「Follows」]]'到'[./preceding-sibling::h4[1] =「Follows」]' –

是的，有道理。謝謝 – Andersson

你真是太棒了先生安德森。什麼表達！是否有可能創建具有相同元素的CSS選擇器來定位相同的東西？一個例子就足夠了。謝謝。 – SIM

如何選擇和提取兩個元素之間的文本？

回答

相關問題