2013-09-01 25 views
2

這裏是我的spider.py文件:Scrapy的Python聲明

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from scinews_com.items import scinews_item 
import codecs 
import sys 

class sci_news_com(BaseSpider): 
    sys.stdout = codecs.getwriter(sys.stdout.encoding)(sys.stdout, errors='replace') 
    name = "scinews" 
    allowed_domains = ["sci-news.com"] 
    start_urls = [ 
     "http://www.sci-news.com/news/astronomy" 
    ] 

    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     bottomposts = hxs.select('//div[@class="bottom-recentpost-wrapper-cat"]')  
     items = [] 
     for bottompost in bottomposts: 
      item = scinews_item() 
      item['Article_Title'] = bottompost.select('div[@class="bottom-archive"]/div[@class="post-content-holder"]/div[@class="bottom-content-heading-0"]/h2/a/text()').extract() 
      item['Article_Desc'] = bottompost.select('div[@class="bottom-archive"]/div[@class="post-content-holder"]/div[@class="post-content"]/p/text()').extract() 
      item['Article_Date'] = bottompost.select('div[@class="bottom-archive"]/div[@class="post-content-holder"]/div[@class="recentpost-dateauthor"]/text()').extract() 
      item['Article_Author'] = bottompost.select('div[@class="bottom-archive"]/div[@class="post-content-holder"]/div[@class="recentpost-dateauthor"]/a/text()').extract() 
      item['Article_Link'] = bottompost.select('div[@class="bottom-archive"]/div[@class="post-content-holder"]/div[@class="bottom-content-heading-0"]/h2/a/@href').extract() 
      item['Article_Image'] = bottompost.select('div[@class="bottom-archive"]/div[@class="bottom-recentpost-image-0"]/a/img/@src').extract() 
      items.append(item) 
     return items 

而且我items.py:

from scrapy.item import Item, Field 

class scinews_item(Item): 
    Article_Title = Field() 
    Article_Desc = Field() 
    Article_Date = Field() 
    Article_Author = Field() 
    Article_Link = Field() 
    Article_Image = Field() 

出於某種原因,並沒有給每個select語句到它自己的「場()「。這是一起輸出所有的人(在CSV & JSON輸出):

Article_Title,Article_Image,Article_Desc,Article_Date,Article_Link,Article_Author 
" HIP 102152: Astronomers Find Oldest Solar Twin 250 Light-Years Away , ALMA Sees Spectacular Newborn Star 1400 Light-Years Away , Astronomers Discover New Earth-Sized Exoplanet Kepler 78b , Hubble Zooms in on Galaxies in Early Universe , Chandra Sees Multimillion-Degree Gas Cloud in NGC 1232 , Pulsar Helps Astronomers Measure Magnetic Field around Milky Way’s Central Black Hole , Astronomers Discover First Ever Six-Image Lensed Quasar , Astronomers See Bizarre Pair in Large Magellanic Cloud ","http://cdn4.sci-news.com/images/2013/08/image_1344_1-HIP102152-195x110.jpg,http://cdn4.sci-news.com/images/2013/08/image_1325-Herbig-Haro-195x110.jpg,http://cdn4.sci-news.com/images/2013/08/image_1324f-Kepler-78b-195x110.jpg,http://cdn4.sci-news.com/images/2013/08/image_1318-galaxies-195x110.jpg,http://cdn4.sci-news.com/images/2013/08/image_1314-NGC-1232-195x110.jpg,http://cdn4.sci-news.com/images/2013/08/image_1312-pulsar-195x110.jpg,http://cdn4.sci-news.com/images/2013/08/image_1304f-quasar-195x110.jpg,http://cdn4.sci-news.com/images/2013/08/image_1301-LMC-195x110.jpg","Astronomers using ESO’s Very Large Telescope have identified the oldest solar twin known to date. 
This image shows the Sun-like star HIP 102152. Credit:...,A team of astronomers using the Atacama Large Millimeter/submillimeter Array (ALMA) has captured a beautiful close-up view of an object named Herbig-Haro...,An international team of astronomers reporting in the Astrophysical Journal (arXiv.org) has discovered an Earth-sized exoplanet called Kepler 78b that...,Astronomers using NASA’s Hubble Space Telescope have established that mature-looking galaxies existed much earlier than previously known, when the...,U.S. astronomer using NASA’s Chandra X-ray Observatory has discovered a massive cloud of hot gas likely caused by a collision between a dwarf galaxy...,An international team of astronomers has used observations of the newly discovered pulsar PSR J1745-2900 to measure the magnetic field emanating from a...,A team of scientists at the University of Copenhagen’s Niels Bohr Institute, Denmark, has reported the discovery of a six-image lensed quasar named...,A team of astronomers using the Very Large Telescope (VLT) at ESO’s Paranal Observatory in Chile has captured an image of two distinctive glowing..."," Aug 29, 2013 by 
, Aug 21, 2013 by 
, Aug 21, 2013 by 
, Aug 16, 2013 by 
, Aug 15, 2013 by 
, Aug 15, 2013 by 
, Aug 12, 2013 by 
, Aug 9, 2013 by 
","http://www.sci-news.com/astronomy/science-oldest-solar-twin-01344.html,http://www.sci-news.com/astronomy/science-alma-newborn-star-01325.html,http://www.sci-news.com/astronomy/science-new-earth-sized-exoplanet-kepler78b-01324.html,http://www.sci-news.com/astronomy/science-hubble-galaxies-early-universe-01318.html,http://www.sci-news.com/astronomy/science-chandra-gas-cloud-ngc1232-01314.html,http://www.sci-news.com/astronomy/science-pulsar-magnetic-field-milky-way-black-hole-01312.html,http://www.sci-news.com/astronomy/science-first-ever-six-image-lensed-quasar-01304.html,http://www.sci-news.com/astronomy/science-large-magellanic-cloud-01301.html","Sci-News.com,Enrico de Lazaro,Sergio Prostak,Enrico de Lazaro,Sci-News.com,Sci-News.com,Sergio Prostak,Sci-News.com" 

預先感謝您!

+0

我想你應該包括'的div [@類=「底-archive「]''在'bottomposts'選擇器內部的步驟,即''bottomposts = hxs.select('// div [@ class =」bottom-recentpost-wrapper-cat「]/div [@ class =」bottom-archive「 ]')'然後是'item ['Article_Title'] = bottompost.select('div [@ class =「post-content-holder」]/div [ class =「bottom-content-heading-0」]/h2/a/text()').extra()'等。 –

+0

@pault。 - 應該發佈爲答案。我測試了OP的代碼(單個項目),然後測試了您的建議,並看到了單個項目。 – Talvalin

+0

感謝您的測試@Talvalin。我沒有時間昨天測試一個修改過的蜘蛛。但我今天早上做了。現在會發佈一個答案。 –

回答

3

縱觀HTML源http://www.sci-news.com/news/astronomy

<div id="content" class="single-wrapper" role="main"> 
    <section> 
     <div class="post-wrapper-archive"> 
      <div class="related-post-wrapper-block">... 
      <div class="bottom-recentpost-wrapper-cat"> 
       <div class="bottom-archive"> 
        <div class="bottom-recentpost-image-0">... 
        <div class="post-content-holder">... 
        <div class="cb"></div> 
       </div> 
       <div class="bottom-archive">... 
       <div class="bottom-archive">... 
       <div class="bottom-archive">... 
       <div class="bottom-archive">... 
       <div class="bottom-archive">... 
       <div class="bottom-archive">... 
       <div class="bottom-archive">... 
       <div class="cb"></div> 
      </div> 
      <div id="pag"> 
     </div> 
    </section> 
</div> 

我建議你移動div[@class="bottom-archive"]步驟bottomposts選擇內:

class sci_news_com(BaseSpider): 
    sys.stdout = codecs.getwriter(sys.stdout.encoding)(sys.stdout, errors='replace') 
    name = "scinews" 
    allowed_domains = ["sci-news.com"] 
    start_urls = [ 
     "http://www.sci-news.com/news/astronomy" 
    ] 

    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     bottomposts = hxs.select('//div[@class="bottom-recentpost-wrapper-cat"]/div[@class="bottom-archive"]')  
     items = [] 
     for bottompost in bottomposts: 
      item = scinews_item() 
      item['Article_Title'] = bottompost.select('div[@class="post-content-holder"]/div[@class="bottom-content-heading-0"]/h2/a/text()').extract() 
      item['Article_Desc'] = bottompost.select('div[@class="post-content-holder"]/div[@class="post-content"]/p/text()').extract() 
      item['Article_Date'] = bottompost.select('div[@class="post-content-holder"]/div[@class="recentpost-dateauthor"]/text()').extract() 
      item['Article_Author'] = bottompost.select('div[@class="post-content-holder"]/div[@class="recentpost-dateauthor"]/a/text()').extract() 
      item['Article_Link'] = bottompost.select('div[@class="post-content-holder"]/div[@class="bottom-content-heading-0"]/h2/a/@href').extract() 
      item['Article_Image'] = bottompost.select('div[@class="bottom-recentpost-image-0"]/a/img/@src').extract() 
      items.append(item) 
     return items