以json格式抓取內容 - Python

我想用Python 3.5來抓取像this這樣的頁面。我使用BeautifulSoup來刮掉它的內容。我在刮取大小的數量時遇到問題。在此特定頁面中，尺寸數量爲9（FR 80 A，FR 80 B，FR 80 C等）。我想這個信息是json格式。我試圖使用json包，但我找不到'開始'和'結束'。我的代碼如下所示：以json格式抓取內容 - Python

import requests 
import json 

page = requests.get('https://www.laperla.com/fr/en/cfiplm000566-bgw532.html') 
content = page.text  
start = content.find('spConfig') + ... 
end = ...  
data = json.loads(content[start:end]) 
sizes = data['attributes']['179']['options'] 
print(len(sizes))

正確的輸出應該是「9」，因爲有9個大小。我不想使用硒或這種包裝。那麼，哪個是正確的「開始」和「結束」？有沒有比我想要的更好的方式來清除這些數據？

來源

2017-10-17 nesi

1。遍歷所有script標籤和搜索目標json

2。用regex搶start和end

3。使用json模塊

for i in soup.select('script'): 
    if 'Product.Config' in str(i): 
     data = re.search(r'(?is)(Product\.Config\()(.*?)(\))',str(i)).group(2) 

json_data = json.loads(data) 
print(len(json_data['attributes']['179']['options'])) 
9

來源

2017-10-17 11:44:47

以json格式抓取內容 - Python

回答

相關問題