2015-04-02 51 views
0

這可能非常簡單,但我對python非常陌生,我根本找不到從哪裏開始。從Python中導出我的網絡抓取結果

因此,我編寫了一個代碼,成功地從網頁中抓取我想要的數據。現在我的問題是我不知道如何將它導出到csv,這是我的代碼的外觀。

import requests 
import csv 
from bs4 import BeautifulSoup 

for numb in range(1, 3): 
    urls= "http://www.blocket.se/bostad/uthyres?cg_multi=3020&cg_multi=3100&cg_multi=3120&cg_multi=3060&cg_multi=3070&sort=&ss=&se=&ros=&roe=&bs=&be=&mre=&q=&q=&q=&save_search=1&l=0&md=th&o=" +str(numb) +"&f=p&f=c&f=b&ca=11&w=3" 
    r = requests.get(urls) 
    soup=BeautifulSoup(r.text, 'html.parser') 
    data = soup.find_all("div", {"itemtype": "http://schema.org/Offer"}) 

    for item in data: 
     try: 
      print item.contents[3].find_all("span", {"class": "subject-param category"})[0].text 
     except: 
      pass 
     try: 
      print item.contents[3].find_all("span", {"class": "subject-param address separator"})[0].text 
     except: 
      pass 
     try: 
      print item.contents[3].find_all("span", {"class": "li_detail_params first rooms"})[0].text 
     except: 
      pass 
     try: 
      print item.contents[3].find_all("span", {"class": "li_detail_params monthly_rent"})[0].text 
     except: 
      pass 
     try: 
      print item.contents[3].find_all("span", {"class": "li_detail_params size"})[0].text 
     except: 
      pass 
     try: 
      print item.contents[3].find_all("span", {"class": "li_detail_params first weekly_rent_offseason"})[0].text 
     except: 
      pass 

而且它打印此:

lägenhet 

       Stockholms stad - Bromma 

1 rum 
4 000 kr/mån 

      villa 

       Linköping 

100 m² 

      lägenhet 

       Stockholms stad - Maria, Gamla Stan, Högalid 

1 rum 
8 000 kr/mån 
36 m² 

      lägenhet 

       Stockholms stad - Hägersten, Liljeholmen 

1 rum 
7 500 kr/mån 
26 m² 

當然它不是最好的輸出,但我真的不關心這個。現在,有人可以指示我如何能夠將其導出到csv?正如我所說,我甚至不知道從哪裏開始。

+0

甚至試過谷歌? – PascalVKooten 2015-04-02 21:31:16

+0

不要抓住每一個例外,'除了:pass'永遠不是一個好主意 – 2015-04-02 21:37:06

回答

0

而不是打印語句,將您的信息添加到列表。最後用csv.writer吐出到控制檯:

import unicodecsv as csv 
from bs4 import BeautifulSoup 
import requests 
import StringIO 



for numb in range(1, 3): 


    urls= "http://www.blocket.se/bostad/uthyres?cg_multi=3020&cg_multi=3100&cg_multi=3120&cg_multi=3060&cg_multi=3070&sort=&ss=&se=&ros=&roe=&bs=&be=&mre=&q=&q=&q=&save_search=1&l=0&md=th&o=" +str(numb) +"&f=p&f=c&f=b&ca=11&w=3" 



    r = requests.get(urls) 
    soup=BeautifulSoup(r.text, 'html.parser') 



    data = soup.find_all("div", {"itemtype": "http://schema.org/Offer"}) 

    data_list = [] 
    for item in data: 
     data_item = {} 
     try: 
      data_item['category'] = item.contents[3].find_all("span", {"class": "subject-param category"})[0].text 
     except: 
      pass 
     try: 
      data_item['address separator'] = item.contents[3].find_all("span", {"class": "subject-param address separator"})[0].text 
     except: 
      pass 
     try: 
      data_item['first rooms'] = item.contents[3].find_all("span", {"class": "li_detail_params first rooms"})[0].text 
     except: 
      pass 
     try: 
      data_item['monthly_rent'] = item.contents[3].find_all("span", {"class": "li_detail_params monthly_rent"})[0].text 
     except: 
      pass 
     try: 
      data_item['size'] = item.contents[3].find_all("span", {"class": "li_detail_params size"})[0].text 
     except: 
      pass 
     try: 
      data_item['weekly_rent_offseason'] = item.contents[3].find_all("span", {"class": "li_detail_params first weekly_rent_offseason"})[0].text 
     except: 
      pass 
     data_list.append(data_item) 
    out = StringIO.StringIO() 
    csv_writer = csv.writer(out) 
    [csv_writer.writerow(data.values()) for data in data_list] 
    print out.getvalue() 

您需要安裝以下庫除了基本系統:

  1. unicodecsv - 對非ASCII字符寫入
  2. beautifulsoup4 - 爲HTML解析
  3. 請求 - HTTP訪問

這d oes爲我吐出CSV,如果它不適合你,請告訴我。

+0

謝謝!我懷疑我必須以某種方式保存數據,因爲在我嘗試保存數據之前,由於某種原因我只保存了一項,最後一項。儘管我確實遇到了一些小問題,但它給了我一個如下所示的錯誤消息: 'Traceback(最近調用最後一次): 文件「/Users/fredrikkopsch/Documents/PythonPrograms/scapingblocket.py」,第83行,在 csv_writer.writerows(data_list) 錯誤:預期的序列號' 任何可能出錯的建議?此代碼是否適用於我,爲每個變量打印行數據?再次感謝@ hd1 – FredrikKopsch 2015-04-03 07:17:18

+0

我們非常歡迎..我只是修復了腳本並添加了更多關於依賴關係的信息。 – hd1 2015-04-03 07:41:01

+0

謝謝!現在它工作沒有錯誤。還有一個問題,我如何將它打印到我可以讀取的csv文件中,稱爲excel? @ HD1 – FredrikKopsch 2015-04-03 08:07:19