2016-05-14 69 views
-1

好的,所以我使用bs4(BeautifulSoup)解析通過網站,並找到我正在尋找的具體標題。我的代碼如下所示:如何擺脫文本上方的空白,使用bs4

import requests 
from bs4 import BeautifulSoup 
url = 'http://www.ewn.co.za/Categories/Local' 
r = requests.get(url).text 
soup = BeautifulSoup(r) 
for i in soup.find_all(class_='article-short'): 
    if i.a: 
     print(i.a.text.replace('\n', '').strip()) 
    else: 
     print(i.contents[0].strip()) 

此代碼的工作,但在其輸出節目,如20線空白的第一,從網站上打印申請標題前。我的代碼有什麼問題,或者有什麼我可以做的,以擺脫空白?

+0

隨着帶的功能,你可以在一個字符串中刪除空格(https://docs.python.org/3/library/stdtypes.html#str.strip) – Querenker

回答

0

因爲你有這樣的內容:

<article class="article-short"> 
<div class="thumb"><a href="http://ewn.co.za/2016/05/14/Contralesa-against-scrapping-initiation-due-to-cold-weather"><img alt="FILE: Boys who have undergone a circumcision ceremony walk near Qunu in the Eastern Cape in 2013. Picture: AFP." height="147" src="http://ewn.co.za/cdn/-%2fmedia%2f3C37CB28056746CD95FC913757AAD41C.ashx%3fas%3d1%26h%3d147%26w%3d234%26crop%3d1;waeb9b8157b3e310df" width="234"/></a></div> 
<h6 class="h6-mega"><a href="http://ewn.co.za/2016/05/14/Contralesa-against-scrapping-initiation-due-to-cold-weather">Contralesa against scrapping initiation due to cold weather</a></h6> 
</article> 

其中第一個鏈接包含圖像,並沒有文字。

您應該尋找代替h6標記。所以,像這樣的工作:

import requests 
from bs4 import BeautifulSoup 
url = 'http://www.ewn.co.za/Categories/Local' 
r = requests.get(url).text 
soup = BeautifulSoup(r) 
for i in soup.find_all(class_='article-short'): 
    title = (i.h6.text.replace('\n', '') if i.h6 else contents[0]).strip() 
    if title: 
     print(title) 
+0

謝謝! @aldanor現在效果更好! – raid3r