2016-07-27 87 views
0

我試圖通過使用beautifulsoup從html代碼中刪除br標記。Python beautifulsoup刪除自我關閉標記

HTML如:

<span class="qualification" style="font-size:14px; font-family: Helvetica, sans-serif;"> 
Doctor of Philosophy (Software Engineering), Universiti Teknologi Petronas 
<br> 
Master of Science (Computer Science), Government College University Lahore 
<br> 
Master of Science (Computer Science), University of Agriculture Faisalabad 
<br> 
Bachelor of Science (Hons) (Agriculture),University of Agriculture Faisalabad 
<br></span> 

我的Python代碼:

for link2 in soup.find_all('br'): 
     link2.extract() 
for link2 in soup.findAll('span',{'class':'qualification'}): 
     print(link2.string) 

的問題是,以前的代碼只是獲取第一個資格。

回答

1

因爲這些都不<br> S的已關閉的同行,美麗的湯加上他們就自動生成了以下HTML:

In [23]: soup = BeautifulSoup(html) 

In [24]: soup.br 
Out[24]: 
<br> 
Master of Science (Computer Science), Government College University Lahore 
<br> 
Master of Science (Computer Science), University of Agriculture Faisalabad 
<br> 
Bachelor of Science (Hons) (Agriculture),University of Agriculture Faisalabad 
<br/></br></br></br> 

當你在第一<br>標籤調用Tag.extract刪除其所有後代和字符串其後代包含:

In [27]: soup 
Out[27]: 
<span class="qualification" style="font-size:14px; font-family: Helvetica, sans-serif;"> 
Doctor of Philosophy (Software Engineering), Universiti Teknologi Petronas 
</span> 

看來,你只需要提取從span元素的所有文本。如果是這樣的話,不要打擾消除任何:

In [28]: soup.span.text 
Out[28]: '\nDoctor of Philosophy (Software Engineering), Universiti Teknologi Petronas\n\nMaster of Science (Computer Science), Government College University Lahore\n\nMaster of Science (Computer Science), University of Agriculture Faisalabad\n\nBachelor of Science (Hons) (Agriculture),University of Agriculture Faisalabad\n' 

Tag.text屬性提取從給定標籤的所有字符串。

+0

所以,如果beautifulsoup自動添加了''
結束標記,可這個問題可以通過使用XHTML兼容''
避免? – HolyDanna

+0

@HolyDanna:是的。儘管如此,OP仍然需要使用'Tag.text'或'Tag.stripped_strings'來獲取'span'的內容。 – vaultah

0

使用解包應該工作

soup = BeautifulSoup(html) 
for match in soup.findAll('br'): 
    match.unwrap() 
0

這裏有一個辦法做到這一點:

for link2 in soup.findAll('span',{'class':'qualification'}): 
    for s in link2.stripped_strings: 
     print(s) 

這是沒有必要刪除<br>標籤,除非你需要以供日後處理去除。這裏link2.stripped_strings是一個生成器,它會生成標記中的每個字符串,並刪除前導和尾隨空格。打印循環可更簡潔地寫爲:

for link2 in soup.findAll('span',{'class':'qualification'}): 
    print(*link2.stripped_strings, sep='\n') 
+0

謝謝,它的工作原理 – Aaron