從python網頁抓取結果中刪除多餘的表格

我的代碼生成了我想要移除的額外表格。我想刪除除此之外的所有其他表格。從python網頁抓取結果中刪除多餘的表格

我的代碼

import csv 
from bs4 import BeautifulSoup 
import requests 
import pandas as pd 
import telnetlib as tn 
import os 
#import sys 
cwd = os.getcwd() 
print (os.getcwd) 
cwd = os.getcwd() 
os.chdir('c:\\Users\STaiwo\Desktop\My R code') 
page = requests.get("https://www.flyingblue.com/earn-and-spend- 
miles/airlines/partner/180/china-eastern.html", verify = False) 
print(page.content) ### Collects HTML content of site 
soup = BeautifulSoup(page.content, 'html.parser') 
print(soup.prettify()) ## Cleans up the content of the site 
for table in soup.findAll('tbody'): 
print('Table') 
list_of_rows = [] 
for row in table.findAll('tr')[1:]: 
    list_of_cells = [] 
    for cell in row.findAll('td'): 
     text = ((cell.text.replace('&nbsp;', ''))) 
     list_of_cells.append(text) 
    list_of_rows.append(list_of_cells) 
print(list_of_rows)

結果目前我得到： 表 [[ '頭等艙'， 'F，U'， '150％']，['P ''，'125％']，['Business Class'，'J，C，D，I'，'125％']，['Premium Economy Class'，'W'，'110％']，''Economy ''，'Y，B'，'100％']，['E，H，M'，'75％']，['L，N，R，S，V，K'，'50％'] ，[ 'T'，'30％ ']，[' 不符合應計」， 'Z，Q，G'， '0％']] 表 [] 表 [] 表 [['英里距離：6,482'，'總']，['Booking sub-class：125％'，'8,103']，['8,103']] 表 [['Distance in miles： [''預訂小組：125％'，'精英獎金：75％'，'12,965']，['8,103'，'4,862']] 表 [['距離英里數：6,482'，'Total']，['Booking sub-class：50％'，'3,241']，['3,241']] 表 [['Distance in miles：6,482'，'Total']， [ '的預訂的子類：50％'， '精英獎金：N/A'， '3241']，[ '3241'， '0']]

我想要的結果： 表 [ ['頭等艙'，'F，U'，'150％']，['P'，'125％']，['巴士「'經濟艙'，'Y'，'B''，''經濟艙'，'J，C，D，I'，'125％']，['Premium Economy Class'，'W'，'110％ ['L，N，R，S，V，K'，'50％']，['T'，'30％'] ]，['不適用於權責發生制'，'Z，Q，G'，'0％']]

來源

2017-05-26 Ade

嘗試將[:1]添加到soup.findAll('tbody')它將限制結果僅限第一個表。

來源

2017-05-26 17:18:02 varela

頁面呈現法語爲我，所以你想在我的瀏覽器中看起來像這樣。

檢查HTML我看到幾個表具有相同的id，即inlineTable。要選擇正確的一個，即使發佈者在頁面上更改此表的位置，也必須能夠以其他方式識別它。我注意到'Classe de cabine'這個標題對於這個表格是獨一無二的，它可能會在英文版中作爲'Cabin class'提供。讓我們使用它。

首先，獲取所有與id表。看看'Classe de cabine'的每張桌子的文字。當您發現吐出行時，除了標題行外。

>>> import requests 
>>> page = requests.get('https://www.flyingblue.com/earn-and-spend-miles/airlines/partner/180/china-eastern.html').text 
>>> from bs4 import BeautifulSoup 
>>> soup = BeautifulSoup(page, 'lxml') 
>>> required_tables = soup.select('#inlineTable') 
>>> len(required_tables) 
7 
>>> for table in required_tables: 
...  if 'Classe de cabine' in table.text: 
...   rows = table.findAll('tr') 
...   for row in rows[1:]: 
...    row 
...    
<tr class="table-highlite-light"> 
<td rowspan="2" width="33%">Première Classe</td> 
<td width="33%">F, U</td> 
<td width="33%">150 %</td> 
</tr> 
<tr class="table-highlite-light"> 
<td>P</td> 
<td>125 %</td> 
</tr> 
<tr class="table-highlite-light"> 
<td>Classe Affaires</td> 
<td>J, C, D, I</td> 
<td>125 %</td> 
</tr> 
<tr class="table-highlite-light"> 
<td>Premium Economy Classe</td> 
<td>W</td> 
<td>110 %</td> 
</tr> 
<tr class="table-highlite-light"> 
<td rowspan="4">Classe Économique</td> 
<td>Y, B</td> 
<td>100 %</td> 
</tr> 
<tr class="table-highlite-light"> 
<td>E, H, M</td> 
<td>75 %</td> 
</tr> 
<tr class="table-highlite-light"> 
<td>L, N, R, S, V, K</td> 
<td>50 %</td> 
</tr> 
<tr class="table-highlite-light"> 
<td>T</td> 
<td>30%</td> 
</tr> 
<tr class="table-highlite-light"> 
<td>Non éligible pour l’accumulation</td> 
<td>Z, Q, G</td> 
<td>0 %</td> 
</tr>

來源

2017-05-26 19:51:24

從python網頁抓取結果中刪除多餘的表格

回答

相關問題