2016-12-16 41 views
2

試圖獲得多個標題,鏈接和日期。只獲得第一個。不知道爲什麼BS4不會獲取所有的項目......這是一個JavaScript問題?BeautifulSoup和Python在js標籤周圍刮取,也許?

from bs4 import BeautifulSoup 
from urllib import urlopen 

html = urlopen("http://www.fiercepharma.com/news") 
soup = BeautifulSoup(html.read().decode('utf-8'),"lxml") 
main_div = soup.select_one("div#content") 
div_sub = main_div.select("div.region.region-content") 

for d in div_sub: 
    date = d.time.get_text() 
    headline = d.h2.a.get_text() 
    url = d.a["href"] 
    print headline, url, date 
+0

多數民衆贊成在sou souping div.region.region-content給你的整個數據作爲一個元素。讓我試一下代碼併發給你。 – rrmerugu

+1

用'div_sub = main_div.select(「。card.horizo​​ntal.views-row」)替換你的'div_sub'部分',它工作正常。 – rrmerugu

+0

謝謝 - 你能告訴我爲什麼它掛在'div.card.horizo​​ntal.views-row'中的'div'上嗎? ...我曾嘗試使用該標籤與前面的div。 –

回答

1

兩個div.card.horizontal.views-row和應該工作@citra_amarillo。我跑這個,它工作雙方

from bs4 import BeautifulSoup 
from urllib import urlopen 


html = urlopen("http://www.fiercepharma.com/news") 
soup = BeautifulSoup(html.read().decode('utf-8'),"lxml") 
main_div = soup.select_one("div#content") 
div_sub = main_div.select(".card.horizontal.views-row") 
#div_sub = main_div.select("div.card.horizontal.views-row") 

for d in div_sub: 
    date = d.time.get_text() 
    headline = d.h2.a.get_text() 
    url = d.a["href"] 
    print headline, url, date 
2

如何使用以下內容來捕獲所有包含鏈接,作者,主頁上發佈日期的文章。您可以將其存儲在字典中,或將其存儲在熊貓數據框中以便輕鬆操作。

from bs4 import BeautifulSoup 
import requests 

baseurl = 'http://www.fiercepharma.com' 
response = requests.get(baseurl) 

soup = BeautifulSoup(response.content) 

cdict = {} 

for group in soup.find_all('div', {'class' : 'card horizontal views-row'}): 
    try: 
     title = group.find('h2', {'class' : 'field-content list-title'}).text 
     link = baseurl + group.find('h2', {'class' : 'field-content list-title'}).find('a', href=True)['href'] 
     author = group.find('span', {'class' : 'field-content'}).find('a').text 
     time = group.find('span', {'class' : 'field-content'}).find('time').text 
     content = group.find('p', {'class' : 'field-content card-text'}).text 
     cdict[link] = {'title' : title, 'author' : author, 'time' : time, 'content' : content} 
    except AttributeError as e: 
     print('[-] Unable to parse {}'.format(e)) 

print(cdict) 
#{'http://www.fiercepharma.com/manufacturing/lonza-bulks-up-5-5b-deal-for-capsugel': {'author': u'Eric Palmer', 
# 'content': u'Swiss CDMO Lonza has pulled the trigger on a $5.5 billion deal to acquire the U.S.-based contract capsule and drug producer Capsugel to create another sizable\u2026', 
# 'time': u'Dec 15, 2016 8:45am', 
# 'title': u'Lonza bulks up with $5.5B deal for Capsugel'},