2017-08-15 48 views
1

我想從很多不同的網址收集信息,並結合基於年份和高爾夫球手名稱的數據。截至目前,我正在嘗試將信息寫入csv,然後使用pd.merge()進行匹配,但我必須爲每個數據幀使用唯一的名稱進行合併。我試圖使用numpy數組,但我堅持最終獲取所有要分離的數據的過程。麻煩合併使用Python中的熊貓和numpy數據刮掉的數據

import csv 
from urllib.request import urlopen 
from bs4 import BeautifulSoup 
import datetime 
import socket 
import urllib.error 
import pandas as pd 
import urllib 
import sqlalchemy 
import numpy as np 

base = 'http://www.pgatour.com/' 
inn = 'stats/stat' 
end = '.html' 
years = ['2017','2016','2015','2014','2013'] 

alpha = [] 
#all pages with links to tables 
urls = ['http://www.pgatour.com/stats.html','http://www.pgatour.com/stats/categories.ROTT_INQ.html','http://www.pgatour.com/stats/categories.RAPP_INQ.html','http://www.pgatour.com/stats/categories.RARG_INQ.html','http://www.pgatour.com/stats/categories.RPUT_INQ.html','http://www.pgatour.com/stats/categories.RSCR_INQ.html','http://www.pgatour.com/stats/categories.RSTR_INQ.html','http://www.pgatour.com/stats/categories.RMNY_INQ.html','http://www.pgatour.com/stats/categories.RPTS_INQ.html'] 
for i in urls: 
    data = urlopen(i) 
    soup = BeautifulSoup(data, "html.parser") 
    for link in soup.find_all('a'): 
     if link.has_attr('href'): 
      alpha.append(base + link['href'][17:]) #may need adjusting 
#data links 
beta = [] 
for i in alpha: 
    if inn in i: 
     beta.append(i) 
#no repeats 
gamma= [] 
for i in beta: 
    if i not in gamma: 
     gamma.append(i) 

#making list of urls with Statistic labels 
jan = [] 
for i in gamma: 
    try: 
     data = urlopen(i) 
     soup = BeautifulSoup(data, "html.parser") 
     for table in soup.find_all('section',{'class':'module-statistics-off-the-tee-details'}): 
      for j in table.find_all('h3'): 
       y=j.get_text().replace(" ","").replace("-","").replace(":","").replace(">","").replace("<","").replace(">","").replace(")","").replace("(","").replace("=","").replace("+","") 
       jan.append([i,str(y+'.csv')]) 
       print([i,str(y+'.csv')]) 
    except Exception as e: 
      print(e) 
      pass 

# practice url 
#jan = [['http://www.pgatour.com/stats/stat.02356.html', 'Last15EventsScoring.csv']] 
#grabbing data 
#write to csv 
row_sp = [] 
rows_sp =[] 
title1 = [] 
title = [] 
for i in jan: 
    try: 
     with open(i[1], 'w+') as fp: 
      writer = csv.writer(fp) 
      for y in years: 
       data = urlopen(i[0][:-4] +y+ end) 
       soup = BeautifulSoup(data, "html.parser") 
       data1 = urlopen(i[0]) 
       soup1 = BeautifulSoup(data1, "html.parser") 
       for table in soup1.find_all('table',{'id':'statsTable'}): 
        title.append('year') 
        for k in table.find_all('tr'): 
         for n in k.find_all('th'): 
          title1.append(n.get_text()) 
          for l in title1: 
           if l not in title: 
            title.append(l) 
        rows_sp.append(title) 
       for table in soup.find_all('table',{'id':'statsTable'}): 
        for h in table.find_all('tr'): 
         row_sp = [y] 
         for j in h.find_all('td'): 
          row_sp.append(j.get_text().replace(" ","").replace("\n","").replace("\xa0"," ").replace("d","")) 
         rows_sp.append(row_sp) 
         print(row_sp) 
         writer.writerows([row_sp]) 
    except Exception as e: 
     print(e) 
     pass 

dfs = [df1,df2,df3] # store dataframes in one list 
df_merge = reduce(lambda left,right: pd.merge(left,right,on=['v1'], how='outer'), dfs) 

的網址,統計種類,所需的格式 的......是的只是所有的東西其間 試圖讓一個行的數據 網址下面的數據[「http://www.pgatour.com/stats/stat.02356.html」,「http://www.pgatour.com/stats/stat.02568.html」, ..., 'http://www.pgatour.com/stats/stat.111.html']

統計標題

LAST 15 EVENTS - SCORING, SG: APPROACH-THE-GREEN, ..., SAND SAVE PERCENTAGE 
year rankthisweek ranklastweek name   events rating rounds avg 
2017 2    3    Rickie Fowler 10  8.8  62 .614  
TOTAL SG:APP MEASURED ROUNDS .... %  # SAVES # BUNKERS TOTAL O/U PAR 
26.386   43    ....70.37 76   108   +7.00 
+1

你的代碼在哪裏使用熊貓?試圖合併的地方在哪裏? – Parfait

+0

沒有嘗試,但它會像dataframes = [df1,df2,df3]#存儲在一個列表中 df_merge = reduce(lambda left,right:pd.merge(left,right,on = ['column'],how ='外部'),dataframes),這是我正在嘗試完成的過程,但我無法使它達到使用它的程度 –

+1

爲什麼鏈不合並工作?錯誤?不想要的結果?你是不是在閱讀數據框的csvs? – Parfait

回答

3

UPDATE(每評論)
這個問題部分是關於技術方法(Pandas merge()),但它似乎是一個討論數據收集和清理的有用工作流程的機會。因此,我比編碼解決方案嚴格要求的內容增加了更多的細節和解釋。

您基本上可以使用與我原始答案相同的方法從不同的URL類別獲取數據。我建議在迭代URL列表時保留{url:data}字典的列表,然後從該字典中構建清理過的數據幀。

設置清潔部分涉及一點點工作,因爲您需要針對每個URL類別中的不同列進行調整。我已經使用手動方法進行了演示,僅使用少量測試URL。但是,如果您擁有數千個不同的URL類別,那麼您可能需要考慮如何以編程方式收集和組織列名稱。這感覺超出了這個OP的範圍。

只要你確定每個URL中有一個yearPLAYER NAME字段,下面的合併應該可以工作。和以前一樣,我們假設您不需要寫入CSV,現在讓我們不要對您的刮碼進行任何優化:

首先,在urls中定義url類別。通過網址類別我指的是http://www.pgatour.com/stats/stat.02356.html將實際上被多次使用,插入一系列的年份到url本身,例如:http://www.pgatour.com/stats/stat.02356.2017.html,http://www.pgatour.com/stats/stat.02356.2016.html。在此示例中,stat.02356.html是包含有關多年玩家數據信息的url類別。

import pandas as pd 

# test urls given by OP 
# note: each url contains >= 1 data fields not shared by the others 
urls = ['http://www.pgatour.com/stats/stat.02356.html', 
     'http://www.pgatour.com/stats/stat.02568.html', 
     'http://www.pgatour.com/stats/stat.111.html'] 

# we'll store data from each url category in this dict. 
url_data = {} 

現在迭代urls。在urls循環中,這段代碼與我的原始答案完全相同,而原始答案又來自於OP - 僅調整了一些變量名稱以反映我們的新捕獲方法。

for url in urls: 
    print("url: ", url) 
    url_data[url] = {"row_sp": [], 
        "rows_sp": [], 
        "title1": [], 
        "title": []} 
    try: 
     #with open(i[1], 'w+') as fp: 
      #writer = csv.writer(fp) 
     for y in years: 
      current_url = url[:-4] +y+ end 
      print("current url is: ", current_url) 
      data = urlopen(current_url) 
      soup = BeautifulSoup(data, "html.parser") 
      data1 = urlopen(url) 
      soup1 = BeautifulSoup(data1, "html.parser") 
      for table in soup1.find_all('table',{'id':'statsTable'}): 
       url_data[url]["title"].append('year') 
       for k in table.find_all('tr'): 
        for n in k.find_all('th'): 
         url_data[url]["title1"].append(n.get_text()) 
         for l in url_data[url]["title1"]: 
          if l not in url_data[url]["title"]: 
           url_data[url]["title"].append(l) 
       url_data[url]["rows_sp"].append(url_data[url]["title"]) 
      for table in soup.find_all('table',{'id':'statsTable'}): 
       for h in table.find_all('tr'): 
        url_data[url]["row_sp"] = [y] 
        for j in h.find_all('td'): 
         url_data[url]["row_sp"].append(j.get_text().replace(" ","").replace("\n","").replace("\xa0"," ").replace("d","")) 
        url_data[url]["rows_sp"].append(url_data[url]["row_sp"]) 
        #print(row_sp) 
        #writer.writerows([row_sp]) 
    except Exception as e: 
     print(e) 
     pass 

現在在url_data每個鍵urlrows_sp包含你感興趣的特定URL類別的數據。
請注意,rows_sp現在實際上是url_data[url]["rows_sp"],當我們遍歷url_data時,但接下來的幾個代碼塊來自我的原始答案,因此使用舊的rows_sp變量名稱。

# example rows_sp 
[['year', 
    'RANK THIS WEEK', 
    'RANK LAST WEEK', 
    'PLAYER NAME', 
    'EVENTS', 
    'RATING', 
    'year', 
    'year', 
    'year', 
    'year'], 
['2017'], 
['2017', '1', '1', 'Sam Burns', '1', '9.2'], 
['2017', '2', '3', 'Rickie Fowler', '10', '8.8'], 
['2017', '2', '2', 'Dustin Johnson', '10', '8.8'], 
['2017', '2', '3', 'Whee Kim', '2', '8.8'], 
['2017', '2', '3', 'Thomas Pieters', '3', '8.8'], 
... 
] 

rows_sp直接將數據幀顯示的數據是不完全正確的格式:

pd.DataFrame(rows_sp).head() 
     0    1    2    3  4  5  6 \ 
0 year RANK THIS WEEK RANK LAST WEEK  PLAYER NAME EVENTS RATING year 
1 2017   None   None   None None None None 
2 2017    1    1  Sam Burns  1  9.2 None 
3 2017    2    3 Rickie Fowler  10  8.8 None 
4 2017    2    2 Dustin Johnson  10  8.8 None 

     7  8  9 
0 year year year 
1 None None None 
2 None None None 
3 None None None 
4 None None None 

pd.DataFrame(rows_sp).dtypes 
0 object 
1 object 
2 object 
3 object 
4 object 
5 object 
6 object 
7 object 
8 object 
9 object 
dtype: object 

隨着一點點的清理,我們可以得到rows_sp與相應的數字數據幀數據類型:

df = pd.DataFrame(rows_sp, columns=rows_sp[0]).drop(0) 
df.columns = ["year","RANK THIS WEEK","RANK LAST WEEK", 
       "PLAYER NAME","EVENTS","RATING", 
       "year1","year2","year3","year4"] 
df.drop(["year1","year2","year3","year4"], 1, inplace=True) 
df = df.loc[df["PLAYER NAME"].notnull()] 
df = df.loc[df.year != "year"] 
num_cols = ["RANK THIS WEEK","RANK LAST WEEK","EVENTS","RATING"] 
df[num_cols] = df[num_cols].apply(pd.to_numeric) 

df.head() 
    year RANK THIS WEEK RANK LAST WEEK  PLAYER NAME EVENTS RATING 
2 2017    1    1.0  Sam Burns  1  9.2 
3 2017    2    3.0 Rickie Fowler  10  8.8 
4 2017    2    2.0 Dustin Johnson  10  8.8 
5 2017    2    3.0  Whee Kim  2  8.8 
6 2017    2    3.0 Thomas Pieters  3  8.8 

修訂清潔
現在我們有了一系列的url類別來應對,每個類別都有一組不同的字段進行清理,上面的部分會變得更復雜一些。如果你只有幾頁,這可能是可行的,只是視覺上查看每個類別的領域,並存儲它們,就像這樣:

cols = {'stat.02568.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK', 
             'PLAYER NAME', 'ROUNDS', 'AVERAGE', 
             'TOTAL SG:APP', 'MEASURED ROUNDS', 
             'year1', 'year2', 'year3', 'year4'], 
          'numeric':['RANK THIS WEEK', 'RANK LAST WEEK', 'ROUNDS', 
             'AVERAGE', 'TOTAL SG:APP', 'MEASURED ROUNDS',] 
          }, 
     'stat.111.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK', 
            'PLAYER NAME', 'ROUNDS', '%', '# SAVES', '# BUNKERS', 
            'TOTAL O/U PAR', 'year1', 'year2', 'year3', 'year4'], 
         'numeric':['RANK THIS WEEK', 'RANK LAST WEEK', 'ROUNDS', 
            '%', '# SAVES', '# BUNKERS', 'TOTAL O/U PAR'] 
         }, 
     'stat.02356.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK', 
             'PLAYER NAME', 'EVENTS', 'RATING', 
             'year1', 'year2', 'year3', 'year4'], 
          'numeric':['RANK THIS WEEK', 'RANK LAST WEEK', 
             'EVENTS', 'RATING'] 
          } 
     } 

然後你可以遍歷url_data再次並將其存儲在一個dfs集合:

dfs = {} 

for url in url_data: 
    page = url.split("/")[-1] 
    colnames = cols[page]["columns"] 
    num_cols = cols[page]["numeric"] 
    rows_sp = url_data[url]["rows_sp"] 
    df = pd.DataFrame(rows_sp, columns=rows_sp[0]).drop(0) 
    df.columns = colnames 
    df.drop(["year1","year2","year3","year4"], 1, inplace=True) 
    df = df.loc[df["PLAYER NAME"].notnull()] 
    df = df.loc[df.year != "year"] 
    # tied ranks (e.g. "T9") mess up to_numeric; remove the tie indicators. 
    df["RANK THIS WEEK"] = df["RANK THIS WEEK"].str.replace("T","") 
    df["RANK LAST WEEK"] = df["RANK LAST WEEK"].str.replace("T","") 
    df[num_cols] = df[num_cols].apply(pd.to_numeric) 
    dfs[url] = df 

在這一點上,我們通過yearPLAYER NAME準備merge所有不同的數據類別。 (你實際上可以反覆在清洗循環合併,但我在這裏分離爲示範的目的。)

master = pd.DataFrame() 
for url in dfs: 
    if master.empty: 
     master = dfs[url] 
    else: 
     master = master.merge(dfs[url], on=['year','PLAYER NAME']) 

現在master包含了每個玩家的一年,合併後的數據。這裏有一個觀點到的數據,利用groupby()

master.groupby(["PLAYER NAME", "year"]).first().head(4) 
        RANK THIS WEEK_x RANK LAST WEEK_x EVENTS RATING \ 
PLAYER NAME year              
Aam Hawin 2015    66    66.0  7  8.2 
      2016    80    80.0  12  8.1 
      2017    72    45.0  8  8.2 
Aam Scott 2013    45    45.0  10  8.2 

        RANK THIS WEEK_y RANK LAST WEEK_y ROUNDS_x AVERAGE \ 
PLAYER NAME year               
Aam Hawin 2015    136    136  95 -0.183 
      2016    122    122  93 -0.061 
      2017    56    52  84 0.296 
Aam Scott 2013    16    16  61 0.548 

        TOTAL SG:APP MEASURED ROUNDS RANK THIS WEEK \ 
PLAYER NAME year             
Aam Hawin 2015  -14.805    81    86 
      2016  -5.285    87    39 
      2017  18.067    61    8 
Aam Scott 2013  24.125    44    57 

        RANK LAST WEEK ROUNDS_y  % # SAVES # BUNKERS \ 
PLAYER NAME year               
Aam Hawin 2015    86  95 50.96  80  157 
      2016    39  93 54.78  86  157 
      2017    6  84 61.90  91  147 
Aam Scott 2013    57  61 53.85  49   91 

        TOTAL O/U PAR 
PLAYER NAME year     
Aam Hawin 2015   47.0 
      2016   43.0 
      2017   27.0 
Aam Scott 2013   11.0 

您可能需要做合併列多一點的清潔,因爲一些跨類別的數據複製(如ROUNDS_xROUNDS_y)。從我所知道的情況來看,重複的字段名稱似乎包含完全相同的信息,因此您可以放棄每個版本的_y版本。

+0

謝謝你,這真棒,我不會在數年內彙總數據,我想從所有其他網址獲取數據並添加到主數據框中。 –

+1

不客氣!這個答案是否爲您最初的問題提供了足夠的解決方案如果是這樣,請考慮通過點擊答案左側的複選標記來標記該答案。如果沒有,你陷入了什麼困境? –

+0

技術上沒有,但它已經回答了我的另一個問題,我陷入了從數據中包含的所有信息中包含的信息中製作一個大型數據框,我無法合併數據,因爲我想要這樣做的方式是讓每個將自己的數據轉化爲自己的df,然後根據名稱年進行合併,以便每個玩家行在每個網址中都包含一個數據幀中的所有信息 –