如何湊網頁缺乏使用BeautifulSoup

我想從這個網頁刮數據標籤：http://www.kitco.com/texten/texten.html 如何湊網頁缺乏使用BeautifulSoup

這裏是我使用的代碼：

import requests 
from bs4 import BeautifulSoup 

url = "http://www.kitco.com/texten/texten.html" 
r = requests.get(url) 

# Doing this to force UFT-8 encoding. Not sure if this is needed... 
r.encoding = "UTF-8" 

soup = BeautifulSoup(r.content) 
tag = soup.find_all("London Fix") 
print tag

正如您看到的，而查看該頁面的來源，術語「倫敦修復」是不是在任何標籤 - 我不知道這是否是cdata或什麼...

任何想法如何解析這些表？

來源

2014-08-29 Jeffrey Stilwell

如果您正在使用的是r.content，則確實不需要設置r.encoding。順便說一句，這是完全正確的。 – 2014-08-29 17:20:21

我認爲這太寬泛了，但我也可以證明'你不清楚你問的是什麼'，因爲你沒有指定你期望的輸出。 – 2014-08-29 17:21:45

我建議你開始閱讀[BeautifulSoup文檔]（http://www.crummy.com/software/BeautifulSoup/bs4/doc/）更仔細一點，看看'soup.find_all（）'*做*，作爲好。 – 2014-08-29 17:22:26

正如@shaktimaan在評論中指出的那樣，「倫敦修復」表格不是真實的 - 它位於pre標記內，行使用破折號格式化。

一個辦法是找到表前font標籤，並獲得.next_sibling：

import requests 
from bs4 import BeautifulSoup 

url = "http://www.kitco.com/texten/texten.html" 
r = requests.get(url) 

soup = BeautifulSoup(r.content) 
print soup.body.pre.find('font', size="4").next_sibling.strip()

打印：

-------------------------------------------------------------------------------- 
London Fix   GOLD   SILVER  PLATINUM   PALLADIUM 
       AM  PM     AM  PM   AM  PM 
-------------------------------------------------------------------------------- 
Aug 29,2014 1285.75 1285.75 19.4700 1424.00 1424.00 895.00 NA 
Aug 28,2014 1288.00 1292.00 19.7500 1425.00 1428.00 897.00 898.00 
-------------------------------------------------------------------------------- 
...

另一種辦法是通過text搜索（產生相同的輸出）：

import re 

print soup.body.pre.find(text=re.compile('London Fix'))

來源

2014-08-29 17:46:17 alecxe

如何湊網頁缺乏使用BeautifulSoup

回答

相關問題