2013-05-12 106 views
0

我想比較一個字符串與html頁面的內容。但是HTML頁面中的特殊字符使得這種比較更加困難。所以我想在比較之前從HTML頁面中刪除所有特殊字符和空格。但所有標籤都必須保持不變。 是BeautifulSoup刪除標記內容的特殊字符

<div class="abc bcd"> 
     <div class="inner1"> Hai ! this is first inner div;</div> 
     <div class="inner2"> "this is second div... " </div> 
</div> 

這應該轉換爲

<div class="abc bcd"> 
      <div class="inner1">Haithisisfirstinnerdiv</div> 
      <div class="inner2">thisisseconddiv</div> 
</div> 

這可怎麼辦呢?

+0

找出如何用BeautifulSoup替換文本。 – Blender 2013-05-12 03:57:47

回答

0

查找所有葉子標籤並更改其字符串。

alphabet = 'abcdefghijklmnopqrtsuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ' 

def replace(soup): 
    for child in soup.children: 
     if child.string: 
      child.string = ''.join([ch for ch in child.string if ch in alphabet]) 
     else: 
      replace(child) 

from bs4 import BeautifulSoup 

orig_string = """ 
<div class="abc bcd"> 
     <div class="inner1"> Hai ! this is first inner div;</div> 
     <div class="inner2"> "this is second div... " </div> 
</div> """ 

soup = BeautifulSoup(orig_string) 
print soup.prettify() # original HTML 
replace(soup) 
print 
print soup.prettify() # new HTML 

輸出:

<div class="inner1"> 

轉到

<div class="inner1"> 

下面是如何:

<html> 
<body> 
    <div class="abc bcd"> 
    <div class="inner1"> 
    Hai ! this is first inner div; 
    </div> 
    <div class="inner2"> 
    "this is second div... " 
    </div> 
    </div> 
</body> 
</html> 

<html> 
<body> 
    <div class="abc bcd"> 
    <div class="inner1"> 
    Haithisisfirstinnerdiv 
    </div> 
    <div class="inner2"> 
    thisisseconddiv 
    </div> 
    </div> 
</body> 
</html> 
+1

只是一件小事,'輸入字符串; string.letters'產生小寫字母和大寫字母:) – TerryA 2013-05-12 04:34:41

+0

對於Unicode意識,不要枚舉所有字母。相反,輸入「unicodedata」而不是'ch in alphabet',使用'unicodedata.category(ch)[0] =='L''。 – icktoofay 2013-05-12 05:04:42

+0

另外,你的'child ='在'child = replace(child)'中沒有用處。 – icktoofay 2013-05-12 05:05:01

0

首先,BeautifulSoup調用BeautifulSoup()所以當人們已經修復了一些破碎的HTML得到擺脫空白和特殊字符:

>>> from bs4 import BeautifulSoup 
>>> html = """<div class="abc bcd"> 
    <div class="inner1"> Hai ! this is first inner div;</div> 
    <div class="inner2"> "this is second div... " </div> 
</div>""" 
>>> soup = BeautifulSoup(html) 
>>> for divtag in soup.findAll('div'): 
...  if 'inner' in divtag['class'][0]: 
...   divtag.string = ''.join(i for i in divtag.string if i.isalnum()) 
>>> print soup 
<html><body><div class="abc bcd"> 
<div class="inner1">Haithisisfirstinnerdiv</div> 
<div class="inner2">thisisseconddiv</div> 
</div></body></html>