2017-08-10 109 views
1

有一些很好的線程(其中有些幫助我達到了這一點),但我似乎無法弄清楚爲什麼我的程序無法正常工作。刮網|美麗的湯|解析表

問題:該程序工作,但它似乎只是返回第一行,當它應該循環所有的錶行。

我使用Python 3.5

import requests 
from bs4 import BeautifulSoup 
import pandas as pd 

url = "http://www.the-numbers.com/movies/year/2006" 

r = requests.get(url) 
soup = BeautifulSoup(r.content) 

data = [] 

for table_row in soup.select("table"): 
    cells = table_row.find_all(['td']) 
    release_date = cells[0].text.strip() 
    movie_name = cells[2].text.strip() 
    genre_name = cells[3].text.strip() 
    production_budget = cells[4].text.strip() 
    box_office = cells[5].text.strip() 
    movie = {"Release_Date" : release_date, 
      "Movie_Name" : movie_name, 
      "Genre" : genre_name, 
      "Production_Budget" : production_budget, 
      "Box_Office" : box_office} 
    data.append(movie) 
    print (release_date, movie_name, genre_name, production_budget, box_office) 

這將返回2006年1月吸血萊恩行動$ 25,000,000 $二百四十萬五千四百二十零這是正確的,但我需要在表中的所有其他行。

如果這個問題很容易解決,將它放入Pandas DataFrame將是下一步(但在響應中不是必需的)。

任何幫助將不勝感激。

回答

3

您可以使用read_html一些數據清洗:

df = pd.read_html('http://www.the-numbers.com/movies/year/2006', header=0)[0] 
df = df.dropna(how='all') 
df['Release Date'] = df['Release Date'].ffill() 
print (df.head()) 
    Release Date   Movie Genre ProductionBudget \ 
0 January, 2006   NaN  NaN    NaN 
1  January 6  BloodRayne Action  $25,000,000 
2  January 6  Fateless Drama  $12,000,000 
3  January 6 Grandma's Boy Comedy  $5,000,000 
4  January 6   Hostel Horror  $4,800,000 

    DomesticBox Officeto Date Trailer 
0      NaN  NaN 
1    $2,405,420  NaN 
2     $196,857  NaN 
3    $6,090,172  NaN 
4    $47,326,473  NaN 

你原來的解決方案:

data = [] 
#find first table 
tab = soup.select("table")[0] 
#find all tr elements 
rows = tab.find_all(['tr']) 
#loop anf find all td 
for row in rows: 
    cols = row.find_all('td') 
    #parse text 
    cols = [ele.text.strip() for ele in cols] 
    #[:-1] remove last column 
    data.append(cols[:-1]) 

cols = ['Release_Date','Movie_Name','Genre','Production_Budget','DomesticBox'] 
#[2:] remove first 2 rows 
df = pd.DataFrame(data[2:], columns = cols) 
print (df.head()) 
    Release_Date  Movie_Name Genre Production_Budget DomesticBox 
0 January 6  BloodRayne Action  $25,000,000 $2,405,420 
1     Fateless Drama  $12,000,000  $196,857 
2    Grandma's Boy Comedy  $5,000,000 $6,090,172 
3      Hostel Horror  $4,800,000 $47,326,473 
4    Kill the Poor          $0 
+0

這是完美的,正是我想要的。非常感謝你。出於好奇,你知道爲什麼我的原始代碼只返回第一行嗎? – AdrianC

+0

我認爲你需要在td元素中循環,而不是在表格中,因爲表格只有一個。 – jezrael

+0

這很完美 - 謝謝你幫助我。這是非常感謝,它的工作完美。問題解決了:) – AdrianC