使用Scrapy刮表格後提交數據

我試圖從上市，只能通過點擊「視圖」按鈕來觸發此表單提交查看詳細信息頁面的內容拼湊而成。我是新來的Python和Scrapy使用Scrapy刮表格後提交數據

示例標記

<li><h3>Abc Widgets</h3> 
    <form action="/viewlisting?id=123" method="post"> 
     <input type="image" src="/images/view.png" value="submit" > 
    </form> 
</li>

我的Scrapy的解決方案是提取表單操作，然後使用請求與回調返回頁面解析它爲想要的內容。不過，我已經打了幾個問題

我得到以下錯誤「請求的URL必須是海峽或Unicode」
其次，當我硬編碼的URL來克服上述問題，看來我的解析函數返回什麼看起來像一個列表

這裏是我的代碼 - 與真實的URL的反應

from scrapy.spiders import Spider 
from scrapy.selector import Selector 
from scrapy.http import Request 
from wfi2.items import Wfi2Item 

class ProfileSpider(Spider): 
    name = "profiles" 

    allowed_domains = ["wfi.com.au"] 
    start_urls = ["http://example.com/wps/wcm/connect/internet/wfi/Contact+Us/Find+Your+Local+Office/findYourLocalOffice.jsp?state=WA", 
    "http://example.com/wps/wcm/connect/internet/wfi/Contact+Us/Find+Your+Local+Office/findYourLocalOffice.jsp?state=VIC", 
    "http://example.com/wps/wcm/connect/internet/wfi/Contact+Us/Find+Your+Local+Office/findYourLocalOffice.jsp?state=QLD", 
    "http://example.com/wps/wcm/connect/internet/wfi/Contact+Us/Find+Your+Local+Office/findYourLocalOffice.jsp?state=NSW", 
    "http://example.com/wps/wcm/connect/internet/wfi/Contact+Us/Find+Your+Local+Office/findYourLocalOffice.jsp?state=TAS" 
    "http://example.com/wps/wcm/connect/internet/wfi/Contact+Us/Find+Your+Local+Office/findYourLocalOffice.jsp?state=NT" 
    ] 



    def parse(self, response): 

     hxs = Selector(response) 
     forms = hxs.xpath('//*[@id="area-managers"]//*/form') 

     for form in forms: 

      action = form.xpath('@action').extract() 
      print "ACTION: ", action 
      #request = Request(url=action,callback=self.parse_profile) 
      request = Request(url=action,callback=self.parse_profile) 
      yield request 

    def parse_profile(self, response): 
     hxs = Selector(response) 
     profile = hxs.xpath('//*[@class="contentContainer"]/*/text()') 

     print "PROFILE", profile

來源

2015-10-20 htmlr

我得到以下錯誤「請求的URL必須是海峽或Unicode」

請看看爲extract()的scrapy文檔。 It says：「序列化和返回匹配的節點爲列表Unicode字符串的」（由我大膽的添加）。

列表中的第一個元素可能是你想要的東西。所以，你可以這樣做：

request = Request(url=response.urljoin(action[0]), callback=self.parse_profile)

其次，當我硬編碼的URL來克服上述問題，看來我解析函數返回什麼樣子了列表

據xpath文檔這是一個SelectorList。添加extract()到xpath，你會得到與文本標記列表。最終，您希望在進一步處理之前清理並加入列表中的元素。

來源

2015-10-21 09:06:53

感謝您的明確的解釋，並調用了文檔的相關章節 – htmlr

使用Scrapy刮表格後提交數據

回答

相關問題