Python抓取表元素

我想從此網頁（http://www.basketball-reference.com/teams/CHO/2017.html）中提取與表（Team Misc）對應的所有元素。Python抓取表元素

我想從「團隊」中提取所有數字 - （此行： 17 13 2.17 -0.51 1.66 106.9 104.7 96.5 .300 .319 .493 10.9 20.5 .228 .501 11.6 79.6 .148 Spectrum Center 269 ，47）

import urllib2 
from bs4 import BeautifulSoup 

htmla = urllib2.urlopen('http://www.basketball-reference.com/teams/CHO/2017.html') 
bsObja=BeautifulSoup(htmla,"html.parser") 
tables = bsObja.find_all("table")

試過上面的代碼，希望能得到所有表的列表，然後選擇正確的。但現在事情我怎麼嘗試，我只從這個頁面得到1個表。

任何想法的另一種方法？

來源

2016-12-26 Sogard N

請將該圖片直接包含在您的問題中，而不是隨時可能會被破解的鏈接。 – ettanany

此頁面包含HTML中的所有數據，但隱藏爲註釋並使用JavaScript顯示。但是你可以用'BeautifuSoup'來找到這個註釋，刪除'<！ - '和' - >'並用'BeautifuSoup'使用結果來獲取數據。我認爲這個問題在之前的一些問題中得到了解決。 – furas

這個頁面有隱藏在評論中所有表和JavaScript使用它來顯示錶，可能是顯示之前排序或篩選。

所有的評論都在<div class='placeholder'>之後，所以你可以用它來找到這個評論，從評論中獲取所有文本，並使用BS來解析它。

#!/usr/bin/env python3 

#import urllib.request 
import requests 
from bs4 import BeautifulSoup as BS 

url = 'http://www.basketball-reference.com/teams/CHO/2017.html' 

#html = urllib.request.urlopen(url) 
html = requests.get(url).text 

soup = BS(html, 'html.parser') 

placeholders = soup.find_all('div', {'class': 'placeholder'}) 

total_tables = 0 

for x in placeholders: 
    # get elements after placeholder and join in one string 
    comment = ''.join(x.next_siblings) 

    # parse comment 
    soup_comment = BS(comment, 'html.parser') 

    # search table in comment 
    tables = soup_comment.find_all('table') 

    # ... do something with table ... 

    #print(tables) 

    total_tables += len(tables) 

print('total tables:', total_tables)

這樣我發現11個表格隱藏在評論中。

來源

2016-12-27 00:14:16 furas

我想你想

tables = bsObja.findAll("table")

來源

2016-12-26 20:46:40 HenryM

不過，我只有一張桌子:( –

我剛纔看了一下頁面，這是因爲這些表格是由JavaScript加載的，你需要使用Selenium – HenryM

好吧，我會看看它是如何工作的。感謝您的建議。 –

在BS的評論對象的數據，以及評論的對象僅僅是一種特殊類型的NavigableString的，你需要做的是：

查找包含該信息的刺痛
使用BeautifulSoup轉換字符串對象BS
提取數據從BS對象

代碼：

import re 
table_string = soup.find(string=re.compile('div_team_misc'))

這將返回一個包含表的HTML代碼的刺痛。

table = BeautifulSoup(table_string, 'lxml')

使用的刺痛從對象

for tr in table.find_all('tr', class_=False): 
    s = [td.string for td in tr('td')] 
    print(s)

出構建BS對象，並提取數據：

['17', '13', '2.17', '-0.51', '1.66', '106.9', '104.7', '96.5', '.300', '.319', '.493', '10.9', '20.5', '.228', '.501', '11.6', '79.6', '.148', 'Spectrum Center', '269,471'] 
['10', '9', '8', '24', '10', '17', '5', '15', '4', '11', '22', '1', '27', '5', '12', '28', '3', '1', None, '15']

更多評論：

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>" 
soup = BeautifulSoup(markup) 
comment = soup.b.string

Comment對象只是一個特殊類型的NavigableString，BS會從中提取字符串，我們不需要更改或替換任何html。

comment 
# u'Hey, buddy. Want to buy a used parser'

在此基礎上，我們可以用它代替re純BS提取評論

table_string = soup.find(id="all_team_misc").contents[-2]

如果你有什麼發現所有的表字符串，你可以這樣做：

from bs4 import Commnet 
tables = soup.find_all(string=lambda text:isinstance(text,Comment) and str(text).startswith(' \n'))

來源

2016-12-27 03:37:45

Python抓取表元素

回答

相關問題