我想抓scrapy this使用scrapy的網站。頁面結構如下:如何選擇和提取兩個元素之間的文本?
<div class="list">
<a id="follows" name="follows"></a>
<h4 class="li_group">Follows</h4>
<div class="soda odd"><a href="...">Star Trek</a></div>
<div class="soda even"><a href="...</a></div>
<div class="soda odd"><a href="..">Star Trek: The Motion Picture</a></div>
<div class="soda even"><a href="..">Star Trek II: The Wrath of Khan</a></div>
<div class="soda odd"><a href="..">Star Trek III: The Search for Spock</a></div>
<div class="soda even"><a href="..">Star Trek IV: The Voyage Home</a></div>
<a id="followed_by" name="followed_by"></a>
<h4 class="li_group">Followed by</h4>
<div class="soda odd"><a href="..">Star Trek V: The Final Frontier</a></div>
<div class="soda even"><a href="..">Star Trek VI: The Undiscovered Country</a></div>
<div class="soda odd"><a href="..">Star Trek: Deep Space Nine</a></div>
<div class="soda even"><a href="..">Star Trek: Generations</a></div>
<div class="soda odd"><a href="..">Star Trek: Voyager</a></div>
<div class="soda even"><a href="..">First Contact</a></div>
<a id="spin_off" name="spin_off"></a>
<h4 class="li_group">Spin-off</h4>
<div class="soda odd"><a href="..">Star Trek: The Next Generation - The Transinium Challenge</a></div>
<div class="soda even"><a href="..">A Night with Troi</a></div>
<div class="soda odd"><a href="..">Star Trek: Deep Space Nine</a></div
</div>
我想選擇之間提取文本:<h4 class="li_group">Follows</h4>
和<h4 class="li_group">Followed by</h4>
然後<h4 class="li_group">Followed by</h4>
和<h4 class="li_group">Spin-off</h4>
之間的文本我想這個代碼:
def parse(self, response):
for sel in response.css("div.list"):
item = ImdbcoItem()
item['Follows'] = sel.css("a#follows+h4.li_group ~ div a::text").extract(),
item['Followed_by'] = sel.css("a#vfollowed_by+h4.li_group ~ div a::text").extract(),
item['Spin_off'] = sel.css("a#spin_off+h4.li_group ~ div a::text").extract(),
return item
但是這個第一個項目提取的所有div不僅僅是div的<h4 class="li_group">Follows</h4>
和<h4 class="li_group">Followed by</h4>
之間的任何幫助真的會Helpfu升!
只是它幫助的情況下,imdb.com有一個(UN)官方的API在哪裏?如果我記得好的話,你可以把所有這些數據清理乾淨。 – Neil