2016-08-01 68 views
-1

我一直在爲一個網站製作一個網絡抓取工具,我想從一個html表格中使用.findall提取所有節點編號,或者可以工作的東西,但我正在努力獲取它,我不斷地收到錯誤,我顯然沒有把正確的標籤。試圖從Python代碼中使用python刮取數據

誰能幫助,HTML代碼如下所示

</div> 

<table class="dataTable" cellpadding="5" cellspacing="0" rules="all" border="1" id="ctl00_ContentPlaceHolder1_dgNodes" style="border-collapse:collapse;"> 
    <tr class="header noBreak"> 
     <td>&nbsp;</td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$ctl00','')">Node Name</a></td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$ctl01','')">Description</a></td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$ctl02','')">MAC Address</a></td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$ctl03','')"></a> 
       <a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$liNodeRoleHeader','')" id="ctl00_ContentPlaceHolder1_dgNodes_ctl00_liNodeRoleHeader">Node Role</a> 
      </td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$ctl04','')">Firmware</a></td><td> 
       <a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$lbUptimeHeader','')" id="ctl00_ContentPlaceHolder1_dgNodes_ctl00_lbUptimeHeader">Uptime</a> 
      </td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$ctl05','')">Users</a></td> 
    </tr><tr onmouseover="this.className = 'highlightedRow';" onmouseout="this.className = 'normalRow';" onclick="GoToNodePage('522');" style="height:18px;"> 

我需要提取上的代碼的最後一行和所有其他gotonodepage號數522,但我不能弄清楚,任何幫助表示讚賞。我也想把提取的數字放入後面使用的列表中。

r2 = s2.get(webpage) 
bsobjswap = BeautifulSoup(r2.content) 

listy = [] 
for link in bsobjswap.findall('tr'): 
    if 'onclick' in link.attrs: 
     listy.append(link) 
print (listy) 

誤差 在bsobjswap.findall鏈接( 'TR'): 類型錯誤: 'NoneType' 對象不是可調用

+0

您的代碼?錯誤消息? – polku

+0

網頁= 「mycompanywebsite.com」 R2 = s2.get(網頁) bsobjswap = BeautifulSoup(r2.content) listy = [] 用於bsobjswap.findall鏈路( 'TR'): 如果的 'onClick'鏈接地址: listy.append(鏈接) print(listy) – ipmev12

+0

錯誤是標準的TypeError:'NoneType'對象不可調用,很明顯是因爲代碼錯誤,我沒有找到任何數據 – ipmev12

回答

-1

嘗試是這樣的:

from bs4 import BeautifulSoup 

xml = """<table class="dataTable" cellpadding="5" cellspacing="0" rules="all" border="1" id="ctl00_ContentPlaceHolder1_dgNodes" style="border-collapse:collapse;"> 
    <tr class="header noBreak"> 
     <td>&nbsp;</td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$ctl00','')">Node Name</a></td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$ctl01','')">Description</a></td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$ctl02','')">MAC Address</a></td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$ctl03','')"></a> 
       <a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$liNodeRoleHeader','')" id="ctl00_ContentPlaceHolder1_dgNodes_ctl00_liNodeRoleHeader">Node Role</a> 
      </td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$ctl04','')">Firmware</a></td><td> 
       <a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$lbUptimeHeader','')" id="ctl00_ContentPlaceHolder1_dgNodes_ctl00_lbUptimeHeader">Uptime</a> 
      </td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgNodes$ctl00$ctl05','')">Users</a></td> 
    </tr><tr onmouseover="this.className = 'highlightedRow';" onmouseout="this.className = 'normalRow';" onclick="GoToNodePage('522');" style="height:18px;">""" 

soup = BeautifulSoup(xml) 
print([i.get('onclick') for i in soup.findAll('tr', attrs={'onclick':True})]) 

這將返回["GoToNodePage('522');"]

從這裏你可以用正則表達式提取數字例如

print([re.findall("\d+", i.get('onclick')) for i in soup.findAll('tr', attrs={'onclick':True})]) 

這將返回[['522']]

+0

感謝GáborErdős的工作完美,一個問題我怎樣才能得到結果進入一個列表,我似乎很努力的事情 – ipmev12

+0

沒想到我想到了它 – ipmev12

+0

其實這是一個嵌套列表的結果。你從'findBell'中找到'BeautifullSoup'的列表,並從''re'的'findall'列出。所以結果將成爲列表中的一個列表。如果'tr'標籤可能只包含一個'onclick'屬性,則可以將're'的'findall'切換爲一個簡單的'find',並且最終會列出一個列表。這取決於問題(整個xml) –