2016-12-15 75 views
0

從網站獲取一些波浪高度,當波高達到兩位數範圍時,我的代碼就會失敗。例如:目前代碼將分別從「1」和「2」中刪除12,而不是「12」。如何用美麗的湯在Python中刮整個整數?

#Author: David Owens 
#File name: soupScraper.py 
#Description: html scraper that takes surf reports from various websites 

import csv 
import requests 
from bs4 import BeautifulSoup 

NUM_SITES = 2 

reportsFinal = [] 

###################### SURFLINE URL STRINGS AND TAG ########################### 

slRootUrl = 'http://www.surfline.com/surf-report/' 
slSunsetCliffs = 'sunset-cliffs-southern-california_4254/' 
slScrippsUrl = 'scripps-southern-california_4246/' 
slBlacksUrl = 'blacks-southern-california_4245/' 
slCardiffUrl = 'cardiff-southern-california_4786/' 

slTagText = 'observed-wave-range' 
slTag = 'id' 

#list of surfline URL endings 
slUrls = [slSunsetCliffs, slScrippsUrl, slBlacksUrl] 

############################################################################### 


#################### MAGICSEAWEED URL STRINGS AND TAG ######################### 

msRootUrl = 'http://magicseaweed.com/' 
msSunsetCliffs = 'Sunset-Cliffs-Surf-Report/4211/' 
msScrippsUrl = 'Scripps-Pier-La-Jolla-Surf-Report/296/' 
msBlacksUrl = 'Torrey-Pines-Blacks-Beach-Surf-Report/295/' 

msTagText = 'rating-text' 
msTag = 'li' 

#list of magicseaweed URL endings 
msUrls = [msSunsetCliffs, msScrippsUrl, msBlacksUrl] 

############################################################################### 

''' 
This class represents a surf break. It contains all wave, wind, & tide data 
associated with that break relevant to the website 
''' 
class surfBreak: 
    def __init__(self, name,low, high, wind, tide): 
     self.name = name 
     self.low = low 
     self.high = high 
     self.wind = wind 
     self.tide = tide  

    #toString method  
    def __str__(self): 
     return '{0}: Wave height: {1}-{2} Wind: {3} Tide: {4}'.format(self.name, 
      self.low, self.high, self.wind, self.tide) 
#END CLASS 

''' 
This returns the proper attribute from the surf report sites 
''' 
def reportTagFilter(tag): 
    return (tag.has_attr('class') and 'rating-text' in tag['class']) \ 
     or (tag.has_attr('id') and tag['id'] == 'observed-wave-range') 
#END METHOD 

''' 
This method checks if the parameter is of type int 
''' 
def representsInt(s): 
    try: 
     int(s) 
     return True 

    except ValueError: 
     return False 
#END METHOD 

''' 
This method extracts all ints from a list of reports 

reports: The list of surf reports from a single website 

returns: reportNums - A list of ints of the wave heights 
''' 
def extractInts(reports): 
    print reports 
    reportNums = [] 
    afterDash = False 
    num = 0 
    tens = 0 
    ones = 0 

    #extract all ints from the reports and ditch the rest 
    for report in reports: 
     for char in report: 
      if representsInt(char) == True: 

       num = int(char)     
       reportNums.append(num) 

      else: 
       afterDash = True 

    return reportNums 
#END METHOD 

''' 
This method iterates through a list of urls and extracts the surf report from 
the webpage dependent upon its tag location 

rootUrl: The root url of each surf website 
urlList: A list of specific urls to be appended to the root url for each 
     break 

tag:  the html tag where the actual report lives on the page 

returns: a list of strings of each breaks surf report 
''' 
def extractReports(rootUrl, urlList, tag, tagText): 
    #empty list to hold reports 
    reports = [] 
    reportNums = [] 
    index = 0 

    #loop thru URLs 
    for url in urlList: 
     try: 
      index += 1 
      #request page 
      request = requests.get(rootUrl + url) 

      #turn into soup 
      soup = BeautifulSoup(request.content, 'lxml') 

      #get the tag where surflines report lives 
      reportTag = soup.findAll(reportTagFilter)[0] 

      reports.append(reportTag.text.strip())  

     #notify if fail 
     except: 
      print 'scrape failure at URL ', index 
      pass 

    reportNums = extractInts(reports) 

    return reportNums 
#END METHOD 

''' 
This method calculates the average of the wave heights 
''' 
def calcAverages(reportList): 
    #empty list to hold averages 
    finalAverages = [] 
    listIndex = 0 
    waveIndex = 0 

    #loop thru list of reports to calc each breaks ave low and high 
    for x in range(0, 6): 
      #get low ave 
      average = (reportList[listIndex][waveIndex] 
       + reportList[listIndex+1][waveIndex])/NUM_SITES 

      finalAverages.append(average) 

      waveIndex += 1 

    return finalAverages 
#END METHOD 

slReports = extractReports(slRootUrl, slUrls, slTag, slTagText) 
msReports = extractReports(msRootUrl, msUrls, msTag, msTagText) 

reportsFinal.append(slReports) 
reportsFinal.append(msReports) 

print 'Surfline:  ', slReports 
print 'Magicseaweed: ', msReports 
+0

那麼你真正要問的是爲什麼你的'extract_ints'不正確地解析值?這將是你的實際[MCVE](http://stackoverflow.com/help/mcve)。由於數據本身似乎從網站上被正確地刮取。 –

+0

@TeemuRisikko我認爲它刮掉了整個HTML元素,這就是爲什麼我有解壓縮整數來解析它。 –

+0

是的,它確實解析了整個元素,但是除此之外(因爲這個),它跟美麗的東西不再有關係,因爲bs不能像你說的那樣進一步解析它。只是語義,但知道根本問題總是很好的。 :) –

回答

0

你實際上並沒有提取整數,但浮動,似乎,因爲reports值是像['0.3-0.6 m']。現在你只需要遍歷每一個字符並將它們逐個轉換爲int或者丟棄。所以難怪你只會得到一位數字。

一個(可以說)簡單的方法來從字符串中提取這些數字是正則表達式:

import re 

FLOATEXPR = re.compile("(\d+\.\d)-(\d+\.\d) {0,1}m") 

def extractFloats(reports): 
    reportNums = [] 
    for report in reports: 
     groups = re.match(FLOATEXPR, report).groups() 
     for group in groups: 
      reportNums.append(float(group)) 
    return reportNums 

這個表達式將匹配您的花車並返回它們作爲一個列表。

詳細地,表達將匹配任何具有'.'之前至少一個數位,和之後的一個位之間的'-',另一浮序列和與'm'' m'結束。然後它將表示浮點的部分分組到一個元組中。例如,['12.0m-3.0m']將返回[12.0, 3.0]。如果您希望在浮點數後有更多數字,則可以在表達式中的第二個'd':s之後添加額外的'+'