2017-04-17 74 views
0

我使用BeautifulSoup 4(和解析器lmxl)來解析用於MLB API的XML文件。該API會爲特定日期的當前遊戲生成記分牌,而且我無法使「美味湯」識別特定的選項卡。美麗的湯找不到第一個標記(XML)

例如,我正在查看today's games,試圖根據他們的away_file_codehome_file_code提取某個團隊的分數和名稱。如果我們看一下在巴爾的摩金鶯隊VS多倫多藍鳥隊,本場比賽的記分牌XML看起來就像這樣:

<games year="2017" month="04" day="16" modified_date="2017-04-17T01:42:57Z" next_day_date="2017-04-17"> 
<game id="2017/04/16/balmlb-tormlb-1" venue="Rogers Centre" game_pk="490271" time="1:07" time_date="2017/04/16 1:07" time_date_aw_lg="2017/04/16 1:07" time_date_hm_lg="2017/04/16 1:07" time_zone="ET" ampm="PM" first_pitch_et="" away_time="1:07" away_time_zone="ET" away_ampm="PM" home_time="1:07" home_time_zone="ET" home_ampm="PM" game_type="R" tiebreaker_sw="N" resume_date="" original_date="2017/04/16" time_zone_aw_lg="-4" time_zone_hm_lg="-4" time_aw_lg="1:07" aw_lg_ampm="PM" tz_aw_lg_gen="ET" time_hm_lg="1:07" hm_lg_ampm="PM" tz_hm_lg_gen="ET" venue_id="14" scheduled_innings="9" description="" away_name_abbrev="BAL" home_name_abbrev="TOR" away_code="bal" away_file_code="bal" away_team_id="110" away_team_city="Baltimore" away_team_name="Orioles" away_division="E" away_league_id="103" away_sport_code="mlb" home_code="tor" home_file_code="tor" home_team_id="141" home_team_city="Toronto" home_team_name="Blue Jays" home_division="E" home_league_id="103" home_sport_code="mlb" day="SUN" gameday_sw="P" double_header_sw="N" game_nbr="1" tbd_flag="N" away_games_back="-" home_games_back="6.5" away_games_back_wildcard="" home_games_back_wildcard="5.5" venue_w_chan_loc="CAXX0504" location="Toronto, Canada" gameday="2017_04_16_balmlb_tormlb_1" away_win="8" away_loss="3" home_win="2" home_loss="10" game_data_directory="/components/game/mlb/year_2017/month_04/day_16/gid_2017_04_16_balmlb_tormlb_1" league="AA"> 
<status status="Final" ind="F" reason="" inning="9" top_inning="N" b="0" s="0" o="3" inning_state="" note="" is_perfect_game="N" is_no_hitter="N"/> 
<linescore>...</linescore> 
<home_runs>...</home_runs> 
<winning_pitcher id="605164" last="Bundy" first="Dylan" name_display_roster="Bundy" number="37" era="1.86" wins="2" losses="1"/> 
<losing_pitcher id="457918" last="Happ" first="J.A." name_display_roster="Happ" number="33" era="4.50" wins="0" losses="3"/> 
<save_pitcher id="" last="" first="" number="" name_display_roster="" era="0" wins="0" losses="0" saves="0" svo="0"/> 
<links mlbtv="bam.media.launchPlayer({calendar_event_id:'14-490271-2017-04-16',media_type:'video'})" wrapup="/mlb/gameday/index.jsp?gid=2017_04_16_balmlb_tormlb_1&mode=wrap&c_id=mlb" home_audio="bam.media.launchPlayer({calendar_event_id:'14-490271-2017-04-16',media_type:'audio'})" away_audio="bam.media.launchPlayer({calendar_event_id:'14-490271-2017-04-16',media_type:'audio'})" home_preview="/mlb/gameday/index.jsp?gid=2017_04_16_balmlb_tormlb_1&mode=preview&c_id=mlb" away_preview="/mlb/gameday/index.jsp?gid=2017_04_16_balmlb_tormlb_1&mode=preview&c_id=mlb" preview="/mlb/gameday/index.jsp?gid=2017_04_16_balmlb_tormlb_1&mode=preview&c_id=mlb" tv_station="SNET-1"/> 
<broadcast>...</broadcast> 
<alerts text="Final score in Toronto: Baltimore 11, Toronto 4" brief_text="At TOR: Final - BAL 11, TOR 4" type="status"/> 
<game_media>...</game_media> 
<video_thumbnail>...</video_thumbnail> 
<video_thumbnails>...</video_thumbnails> 
</game> 
<game>...</game> (etc...) 

下面是一個代碼片段我使用,試圖找到game(不games)標籤,這是屬性。問題是,當我要求遊戲時,它返回None。但是,我可以返回任何其他標籤而不存在問題 - 例如,status完美無缺。

soup = BeautifulSoup(webpage, 'xml') # webpage is the xml file for today's games 
tags = soup.findAll('game', {'home_file_code': 'tor'}) #supposed to find the tags for the home_file_code matching the home team's abbreviation 
for games in tags: 
    print(games.find('status')['status'] #works without an issue 
    print(games.find('game')['home_file_code'] #throws below error, because games.find('game') is None 

類型錯誤: 'NoneType' 對象不是標化

另外,如果我打印孩子列表(print(list(games.children))),它返回除了遊戲的一切。

有沒有什麼我錯過了關於爲什麼它不能抓住第一個標籤的XML?我很困惑,因爲這在不久前對我有用,而且我不確定我改變了什麼導致了錯誤。

回答

0

看來我誤解了find函數。您可以爲關鍵字建立索引,以便在標籤中查找您想要的屬性。所以,基本上是我應該做的事情如下:

soup = BeautifulSoup(webpage, 'xml') # webpage is the xml file for today's games 
tags = soup.findAll('game', {'home_file_code': 'tor'}) 
for games in tags: 
    print(games.find('status')['status'] 
    print(games['home_file_code']) 

現在print(games['home_file_code']會發現home_file_code如預期,因爲它已經我們擡頭標籤中存在。

我確定有人可以給出更徹底的答案,但這是我所遇到的根本性誤解。