2014-09-20 44 views
2

如何刪除所有「document.write('');」從<table> </table>使用beautifulsoup: 我旁邊有原始的HTML如何刪除所有「document.write('');」 with beautifulsoup

document.write('<table>'); 
document.write(' 
<tr> 
    <td> 
    <span class="prod"> 
    some text 
    </span> 
    </td> 
    '); 
document.write(' 
    <td> 
    <span class="prod"> 
    7.70.022 
    </span> 
    </td> 
</tr> 
'); 
document.write('</table>'); 

我需要在下一次的結果與beautifulsoup:

<table> 
<tr> 
    <td> 
    <span class="prod"> 
    some text 
    </span> 
    </td> 
    <td> 
    <span class="prod"> 
    7.70 
    </span> 
    </td> 
</tr> 
</table> 

回答

0

你爲什麼不只是使用regexs刪除的部分不這樣做想要然後使用beautifulsoup解析它?

import re 

data = """document.write('<table>'); 
document.write(' 
<tr> 
    <td> 
    <span class="prod"> 
    some text 
    </span> 
    </td> 
    '); 
document.write(' 
    <td> 
    <span class="prod"> 
    7.70.022 
    </span> 
    </td> 
</tr> 
'); 
document.write('</table>');""" 

pattern = re.compile(r"document\.write\('\n?([^']*?)(?:\n\s*)?'\);") 
data = pattern.sub('\g<1>', data) 
print data 

輸出

<table> 
<tr> 
    <td> 
    <span class="prod"> 
    some text 
    </span> 
    </td> 
    <td> 
    <span class="prod"> 
    7.70.022 
    </span> 
    </td> 
</tr> 
</table>