2016-09-16 49 views
0

我試圖抓住MTA信息頁上的div。當我抓取html並用BeautifulSoup解析它時,它似乎缺少一些數據。美麗的課缺少

這裏是我到目前爲止的代碼

from bs4 import BeautifulSoup 
import urllib # access the web 

# SUBWAY STATUS PROJECT 
userURL = "http://www.mta.info" # MTA SITE 

htmlfile = urllib.urlopen(userURL) #creates html file 
htmldoc = htmlfile.read() #creates html text 

soup = BeautifulSoup(htmldoc, 'html.parser')  

subChart = soup.find(id = 'subwayDiv') 

print subChart 

我使用打印只是爲了確保我得到的所有數據。我發現我錯過了一些我試圖抓住的信息。如果我自己查看頁面,我可以看到我缺少一個顯示地鐵狀態的類。

我很新的節目,所以請介意我的無知

+0

它們是由ajax創建的,而不是常見的靜態html,所以試試另一種方式。 – kiviak

回答

0

在subchart變量查找具有類subwayCategory的元素和存儲id屬性的值。 對於例如:從數據

<div style="float: left; width: 220px; border-bottom: 1px solid #7B7B98; padding: 4px 0;"> 
<div class="span-11"><img alt="1 2 3 Subway" class="subwayIcon_123" src="http://www.mta.info/sites/all/modules/custom/servicestatus/images/img_trans.gif"/></div> 
<div class="subwayCategory" id="123" style="margin-top: 4px;"></div> 

值帶班subwayCategory的div id爲123 現在做出http://www.mta.info/status/subway/{ID}

請求替換爲術語{ID}的這部分您想要的身份證號碼

+0

這不起作用。在瀏覽器或代碼中嘗試。 –

0

該數據是通過ajax請求獲取的,您可以通過獲取信息格式爲json,你需要傳遞一個時間戳您可以與了time.time(唯一得到的)然後只需用json庫解析它:

from time import time 
from json import load, loads 
import urllib 

url = "http://www.mta.info/service_status_json/{}".format(int(time())) 

json_dict = loads(load(urllib.urlopen(url))) 

from pprint import pprint as pp 
pp(json_dict) 

我不會添加所有的輸出有實在是太多了,但使用"BT"我們得到:

{u'line': [{u'Date': {}, 
      u'Time': {}, 
      u'name': u'Bronx-Whitestone', 
      u'status': u'GOOD SERVICE', 
      u'text': {}}, 
      {u'Date': {}, 
      u'Time': {}, 
      u'name': u'Cross Bay', 
      u'status': u'GOOD SERVICE', 
      u'text': {}}, 
      {u'Date': {}, 
      u'Time': {}, 
      u'name': u'Henry Hudson', 
      u'status': u'GOOD SERVICE', 
      u'text': {}}, 
      {u'Date': u'09/16/2016', 
      u'Time': u' 5:57AM', 
      u'name': u'Hugh L. Carey', 
      u'status': u'SERVICE CHANGE', 
      u'text': u"     <span class='TitleServiceChange' >Service Change</span>     <span class='DateStyle'>     &nbsp;Posted:&nbsp;09/16/2016&nbsp; 5:57AM     </span><br/><br/>     HLC - HOV Lane Open 6 AM to 10 AM. Two-Way Operations in effect. Three (3) lanes Manhattan-bound. One (1) lane Brooklyn-bound.    <br/><br/>    "}, 
      {u'Date': {}, 
      u'Time': {}, 
      u'name': u'Marine Parkway', 
      u'status': u'GOOD SERVICE', 
      u'text': {}}, 
      {u'Date': u'09/16/2016', 
      u'Time': u' 5:57AM', 
      u'name': u'Queens Midtown', 
      u'status': u'SERVICE CHANGE', 
      u'text': u"     <span class='TitleServiceChange' >Service Change</span>     <span class='DateStyle'>     &nbsp;Posted:&nbsp;09/16/2016&nbsp; 5:57AM     </span><br/><br/>     QMT - HOV Lane Open 6 AM to 10 AM. Two-Way Operation in effect. Three (3) lanes Manhattan bound. One (1) lane Queens bound.    <br/><br/>         <span class='TitlePlannedWork' >Planned Work</span>     <br/>     <P style='MARGIN: 0in 0in 0pt'><SPAN style=''Times New Roman';2016; Queens-Midtown Tunnel downtown exit; One lane closed. Use 37<SUP>th</SUP></FONT><FONT size=3> St tunnel exit for access to 2</FONT><SUP><FONT size=3>nd</FONT></SUP><FONT size=3> Ave. Motorists should allow extra time and may wish to use an alternate route if possible' Drivers should expect delays and plan accordingly. Motorists can sign up for MTA e-mail or text alerts at </FONT><SPAN style='COLOR: blue'><A href='http://www.mta.info/'><SPAN style='COLOR: #0563c1'><FONT size=3>www.mta.info</FONT></SPAN></A><FONT size=3> </FONT></SPAN><FONT size=3>and check the Bridges and Tunnels homepage or Facebook page for the latest information on this planned work.</FONT></FONT></SPAN></P>    <br/><br/>         <span class='TitlePlannedWork' >Planned Work</span>     <br/>     QMT- MANHATTAN PLAZA WORK REQUIRES CLOSURE OF 'CROSSTOWN' LANES FOR 2 MONTHS. CUSTOMERS SEEKING A CROSSTOWN MANHATTAN ROUTE USE THE UPTOWN LANES; EXPECT DELAYS.    <br/><br/>    "}, 
      {u'Date': u'08/15/2016', 
      u'Time': u' 3:56PM', 
      u'name': u'Robert F. Kennedy', 
      u'status': u'PLANNED WORK', 
      u'text': u"     <span class='TitlePlannedWork' >Planned Work</span>     <br/>     <P style='MARGIN: 0in 0in 0pt'><SPAN style='COLOR: #1f497d'><FONT size=3 face=Calibri>Starting Monday, August 15, 2016 and through early 2018, one lane will be closed on the Queens-to-Manhattan ramp at the Robert F. Kennedy Bridge for roadway rehabilitation. In addition, overnight on Thursday, August 18 and Friday, August 19, there will be a series of intermittent FULL ramp closures, lasting 15-20 minutes each.</FONT></SPAN></P>    <br/><br/>    "}, 
      {u'Date': {}, 
      u'Time': {}, 
      u'name': u'Throgs Neck', 
      u'status': u'GOOD SERVICE', 
      u'text': {}}, 
      {u'Date': u'09/16/2016', 
      u'Time': u' 5:28AM', 
      u'name': u'Verrazano-Narrows', 
      u'status': u'PLANNED WORK', 
      u'text': u"     <span class='TitlePlannedWork' >Planned Work</span>     <br/>     VNB: PLANNED WORK; S. I. BOUND LOWER LEVEL - ONE LANE CLOSED; EXPECT DELAYS.    <br/><br/>    "}]} 

所以你只需要經過的字典,並挑選出你想要的東西。

+0

謝謝,當我回家時,我會嘗試變瘦! –