從腳本HTML提取的腳本

我正在編寫一個腳本來掃描一組鏈接。在每個鏈接中，腳本在表格中搜索一行。一旦找到，它會增加變量total_rank，這是在每個網頁上找到的總和。排名等於行號。從腳本HTML提取的腳本

的代碼看起來是這樣的，並輸出零：

import requests 
from bs4 import BeautifulSoup 
import time 

url_to_scrape = 'https://www.teamrankings.com/ncb/stats/' 
r = requests.get(url_to_scrape) 
soup = BeautifulSoup(r.text, "html.parser") 

stat_links = [] 

for a in soup.select(".chooser-list ul"): 
    list_entry = a.findAll('li') 
    relative_link = list_entry[0].find('a')['href'] 
    link = "https://www.teamrankings.com" + relative_link 
    stat_links.append(link) 

total_rank = 0 

for link in stat_links: 
    r = requests.get(link) 
    soup = BeautifulSoup(r.text, "html.parser") 

    team_rows = soup.select(".tr-table.datatable.scrollable.dataTable.no-footer table") 

    for row in team_rows: 
     if row.findAll('td')[1].text.strip() == 'Oklahoma': 
      rank = row.findAll('td')[0].text.strip() 
      total_rank = total_rank + rank 

    # time.sleep(1) 

print total_rank

調試team_rows爲空後select()通話的是，我也嘗試了不同的標籤。例如，我已經試過soup.select(".scroll-wrapper div")我已經試過所有soup.select("#DataTables_Table_0_wrapper div")正在返回什麼

來源

2016-01-07 kendall weihe

我不認爲'string = str（a）'是你想要的。它返回一個元素的文本表示。 – mic4ael

@ mic4ael我錯了.get需要一個字符串作爲輸入？或者你在說什麼？ –

的選擇

".tr-table datatable scrollable dataTable no-footer tr"

選擇一個<tr>元素的任何地方<no-footer>元素下<dataTable>元素的任何地方....等。

我認爲真的「數據表可滾動dataTable no-footer」是在您的.tr-table？那麼在那種情況下，他們應該和第一類人合併一段時間。所以我相信最終的正確選擇是：

".tr-table.datatable.scrollable.dataTable.no-footer tr"

更新：新的選擇看起來是這樣的：

".tr-table.datatable.scrollable.dataTable.no-footer table"

這裏的問題是，第一部分，.tr-table.datatable ...指表本身。假設你試圖得到這個表的行：

<table class="tr-table datatable scrollable dataTable no-footer" id="DataTables_Table_0" role="grid">

適當的選擇器仍然是我最初建議的選擇器。

來源

2016-01-07 21:33:21 audiodude

我認爲你是正確的，但是沒有解決底層問題 –

請檢查我的帖子我已經更新了代碼，發現了一個新的地方我認爲錯誤是 –

我已經更新了我的答案，PTAL – audiodude

雖然建議的選擇器不適合我，但@ audiodude的答案是正確的。

您不需要檢查table元素的每一個類。這裏是工作的選擇：

team_rows = soup.select("table.datatable tr")

另外，如果你需要找到Oklahoma表裏面 - 你不必在表中的每一行和單元的迭代。只需直接搜索特定的細胞，並獲得先前包含等級：

rank = soup.find("td", {"data-sort": "Oklahoma"}).find_previous_sibling("td").get_text() 
total_rank += int(rank) # it is important to convert the row number to int

另外請注意，你是不是你應該抽出更多的統計鏈接 - 看起來像玩家資料的鏈接，因爲你不應該遵循專注於球隊統計。這裏有一種方法可以獲得Team Stats鏈接：

links_list = soup.find("h2", text="Team Stats").find_next_sibling("ul") 
stat_links = ["https://www.teamrankings.com" + a["href"] 
       for a in links_list.select("ul.expand-content li a[href]")]

來源

2016-01-08 01:35:59 alecxe

從腳本HTML提取的腳本

回答

相關問題