2015-07-09 81 views
0

因此,我正試圖從高爾夫球場提取來自給定網站的數據,並在其中創建一個包含名稱和地址的CSV。對於地址,雖然我從中獲取數據的網站有
標籤將其拆開。是否有可能解析出兩個地址,這兩個地址被
拆分爲兩個單獨的列?<br>使用python和beautifulsoup標記解析

所以它看起來像這樣的HTML

<div class="location">10799 E 550 S<br>Zionsville, Indiana, United States</div> 

我想,這將被分成

Column1:10799 E 550 S 
Column2:Zionsville, Indiana, United States 

這裏是我的代碼:

import csv 
import requests 
from bs4 import BeautifulSoup 

courses_list = [] 

with open('Garmin_GC.csv', 'w') as file: 
    writer = csv.writer(file) 
    for i in range(3): #893 
     url = "http://sites.garmin.com/clsearch/courses/search?course=&location=&country=US&state=&holes=&radius=&lang=en&search_submitted=1&per_page={}".format(
      i * 20) 
     r = requests.get(url) 
     soup = BeautifulSoup(r.text) 
     g_data2 = soup.find_all("div", {"class": "result"}) 
     for item in g_data2: 
      try: 
       name = item.find_all("div", {"class": "name"})[0].text 
      except IndexError: 
       name = '' 
       print "No Name found!" 
      try:  
       address = item.find_all("div", {"class": "location"})[0].get_text(separator=' ') 
       print address 
      except IndexError: 
       address = '' 
       print "No Address found!" 
      writer.writerow([name.encode("utf-8"), address.encode("utf-8")]) 

回答

1

使用.stripped_strings generator

address = list(item.find('div', class_='location').stripped_strings) 

這會產生兩個字符串列表:

>>> from bs4 import BeautifulSoup 
>>> markup = '''<div class="location">10799 E 550 S<br>Zionsville, Indiana, United States</div>''' 
>>> soup = BeautifulSoup(markup) 
>>> list(soup.find('div', class_='location').stripped_strings) 
[u'10799 E 550 S', u'Zionsville, Indiana, United States'] 

把,在你的代碼的情況下:

try: 
    name = item.find('div', class_='name').text 
except AttributeError: 
    name = u'' 
try: 
    address = list(item.find('div', class_='location').stripped_strings) 
except AttributeError: 
    address = [u'', u''] 
writer.writerow([v.encode("utf-8") for v in [name] + address]) 

其中兩個地址值被寫入到兩個單獨的列。

+0

AttributeError:'list'對象沒有屬性'encode'我得到這個錯誤。你能幫我那個 – Gonzalo68

+0

@ Gonzalo68:你現在有一個unicode字符串的列表。編碼列表的內容,而不是列表本身。 –

+0

我該如何去做? – Gonzalo68