如何使用以下內容來捕獲所有包含鏈接,作者,主頁上發佈日期的文章。您可以將其存儲在字典中,或將其存儲在熊貓數據框中以便輕鬆操作。
from bs4 import BeautifulSoup
import requests
baseurl = 'http://www.fiercepharma.com'
response = requests.get(baseurl)
soup = BeautifulSoup(response.content)
cdict = {}
for group in soup.find_all('div', {'class' : 'card horizontal views-row'}):
try:
title = group.find('h2', {'class' : 'field-content list-title'}).text
link = baseurl + group.find('h2', {'class' : 'field-content list-title'}).find('a', href=True)['href']
author = group.find('span', {'class' : 'field-content'}).find('a').text
time = group.find('span', {'class' : 'field-content'}).find('time').text
content = group.find('p', {'class' : 'field-content card-text'}).text
cdict[link] = {'title' : title, 'author' : author, 'time' : time, 'content' : content}
except AttributeError as e:
print('[-] Unable to parse {}'.format(e))
print(cdict)
#{'http://www.fiercepharma.com/manufacturing/lonza-bulks-up-5-5b-deal-for-capsugel': {'author': u'Eric Palmer',
# 'content': u'Swiss CDMO Lonza has pulled the trigger on a $5.5 billion deal to acquire the U.S.-based contract capsule and drug producer Capsugel to create another sizable\u2026',
# 'time': u'Dec 15, 2016 8:45am',
# 'title': u'Lonza bulks up with $5.5B deal for Capsugel'},
多數民衆贊成在sou souping div.region.region-content給你的整個數據作爲一個元素。讓我試一下代碼併發給你。 – rrmerugu
用'div_sub = main_div.select(「。card.horizontal.views-row」)替換你的'div_sub'部分',它工作正常。 – rrmerugu
謝謝 - 你能告訴我爲什麼它掛在'div.card.horizontal.views-row'中的'div'上嗎? ...我曾嘗試使用該標籤與前面的div。 –