2017-04-20 68 views
1

我想從使用BeautifulSoup的文章中提取圖像url和圖像標題。我可以從前面和後面的HTML中分離文章的圖片url和圖片標題,但我無法弄清楚如何將這兩個html標籤分開。這裏是我的代碼:使用BeautifulSoup提取圖像標題和圖像URL

from bs4 import BeautifulSoup 
import requests 
url = 'http://www.prnewswire.com/news-releases/dutch-philosopher- 
koert-van-mensvoort-founder-of-the-next-nature-network-writes-a- 
letter-to-humanity-619925063.html' 
r = requests.get(url) 
html = r.text 
soup = BeautifulSoup(html, 'lxml') 
links = soup.find_all('div', {'class': 'image'}) 

我試圖提取的兩節是src =和title =節。任何想法如何完成這兩個解析將不勝感激。

回答

1
from bs4 import BeautifulSoup 
import requests 
url = 'http://www.prnewswire.com/news-releases/dutch-philosopher-koert-van-mensvoort-founder-of-the-next-nature-network-writes-a-letter-to-humanity-619925063.html' 
r = requests.get(url) 
html = r.text 
soup = BeautifulSoup(html, 'lxml') 
links = soup.find_all('div', {'class': 'image'}) 
print [i.find('img')['src'] for i in links] 
print [i.find('img')['title'] for i in links] 
+0

@Bill如果它解決了您的問題。請接受答案 –

+0

這工作完美。非常感謝你。 –

+0

正確的標記是'html5lib'而不是'lxml',用於'xml' –

0

嘗試以下提取所有圖像標記

img = soup.findAll('img') 
#depending on how many images are here you will probably need to loop through img 
src = img.get('src') 
title = img.get('title') 
0

晚的答案,但你可以使用:

from bs4 import BeautifulSoup 
import requests 
url = 'http://www.prnewswire.com/news-releases/dutch-philosopher-koert-van-mensvoort-founder-of-the-next-nature-network-writes-a-letter-to-humanity-619925063.html' 
r = requests.get(url) 
html = r.text 
soup = BeautifulSoup(html, "html5lib") 
links = soup.find_all('div', {'class': 'image'}) 
if links: 
    print(links[0].find('img')['src']) 
    print(links[0].find('img')['title']) 

輸出:

http://mma.prnewswire.com/media/491859/Koert_van_Mensvoort.jpg?w=950

荷蘭哲學家科特·凡·門斯沃特 - 下一步自然 網絡和「下一步自然」的研究員技術在 埃因霍溫大學的創始人 - 支持 國際地球日寫了一個「信人道」。 (PRNewsfoto/Next Nature Network)