2017-02-11 29 views
-5

我試圖取代<td>標籤:只有<td>如果他們沒有背景顏色信息,只有<td backgrouond:'color' >如果有背景顏色信息。在這兩種情況下,去掉td標籤中的所有其他內容。如何使用R中的正則表達式替換HTML標籤,而不使用替換函數?

重複的例子:

<table class=MsoNormalTable border=0 cellspacing=0 cellpadding=0 width=513 style='width:385.0pt;margin-left:-.15pt;border-collapse:collapse'> 
<tr style='height:15.0pt'> 
<td width=411 nowrap style='width:308.0pt;border:solid windowtext 1.0pt;padding:0in 5.4pt 0in 5.4pt;height:15.0pt'>hdinka</td> 
<td width=103 nowrap valign=bottom style='width:77.0pt;border:solid windowtext 1.0pt;border-left:none;padding:0in 5.4pt 0in 5.4pt;height:15.0pt'>kya</td> 
</tr> 
<tr style='height:15.0pt'> 
<td width=411 nowrap style='width:308.0pt;border:solid windowtext 1.0pt;border-top:none;padding:0in 5.4pt 0in 5.4pt;height:15.0pt'>chika</td> 
<td width=103 nowrap valign=bottom style='width:77.0pt;border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;background:red;padding:0in 5.4pt 0in 5.4pt;height:15.0pt'>&nbsp</td> 
</tr> 
<tr style='height:15.0pt'> 
<td width=411 nowrap style='width:308.0pt;border:solid windowtext 1.0pt;border-top:none;padding:0in 5.4pt 0in 5.4pt;height:15.0pt'>pongal</td> 
<td width=103 nowrap valign=bottom style='width:77.0pt;border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:0in 5.4pt 0in 5.4pt;height:15.0pt'>hawk</td> 
</tr> 
</table> 

所以,如果有在<td>標籤的任何背景的正則表達式的結果應該是這樣的:<td style='background:red;'>,如果沒有背景,那麼結果應該是剛剛<td>

這可以在沒有使用替換功能的情況下完成嗎?如果沒有,請告訴如何。

+0

所以如果在​​標籤的任何背景正則表達式的結果應該是這樣的:''? – shove

+0

@shove。如果沒有背景,那麼結果應該是​​ – Pratham

+0

Waht? 「_something like this:_」**這是**在哪裏? – MYGz

回答

2

你可以嘗試這樣的事情BeautifulSoup:

import re 
from bs4 import BeautifulSoup 

html="""<table class=MsoNormalTable border=0 cellspacing=0 cellpadding=0 width=513 style='width:385.0pt;margin-left:-.15pt;border-collapse:collapse'> 
<tr style='height:15.0pt'> 
<td width=411 nowrap style='width:308.0pt;border:solid windowtext 1.0pt;padding:0in 5.4pt 0in 5.4pt;height:15.0pt'>hdinka</td> 
<td width=103 nowrap valign=bottom style='width:77.0pt;border:solid windowtext 1.0pt;border-left:none;padding:0in 5.4pt 0in 5.4pt;height:15.0pt'>kya</td> 
</tr> 
<tr style='height:15.0pt'> 
<td width=411 nowrap style='width:308.0pt;border:solid windowtext 1.0pt;border-top:none;padding:0in 5.4pt 0in 5.4pt;height:15.0pt'>chika</td> 
<td width=103 nowrap valign=bottom style='width:77.0pt;border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;background:red;padding:0in 5.4pt 0in 5.4pt;height:15.0pt'>&nbsp</td> 
</tr> 
<tr style='height:15.0pt'> 
<td width=411 nowrap style='width:308.0pt;border:solid windowtext 1.0pt;border-top:none;padding:0in 5.4pt 0in 5.4pt;height:15.0pt'>pongal</td> 
<td width=103 nowrap valign=bottom style='width:77.0pt;border-top:none;border-left:none;border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;padding:0in 5.4pt 0in 5.4pt;height:15.0pt'>hawk</td> 
</tr> 
</table>""" 

soup=BeautifulSoup(html, 'html.parser') 
for a in soup.find_all('td'): 
    if 'background' in a.attrs['style']: 
     b = re.findall(r'background:\w+', a.attrs['style']) 
     a.attrs={} 
     a.attrs['style'] = b[0] 
    else: 
     a.attrs={} 
print soup 

輸出:

<table border="0" cellpadding="0" cellspacing="0" class="MsoNormalTable" style="width:385.0pt;margin-left:-.15pt;border-collapse:collapse" width="513"> 
<tr style="height:15.0pt"> 
<td>hdinka</td> 
<td>kya</td> 
</tr> 
<tr style="height:15.0pt"> 
<td>chika</td> 
<td style="background:red"> </td> 
</tr> 
<tr style="height:15.0pt"> 
<td>pongal</td> 
<td>hawk</td> 
</tr> 
</table> 

或只重模塊,而不BeautifulSoup你能做到像這樣:

import re 

res = re.sub(r'(<td)(?!.*background).*?(>)', r'\1\2', html) 
res = re.sub(r'<td.*(background:\w+).*?>', r'<td style="\1">', res) 
print res 

輸出:

<table class=MsoNormalTable border=0 cellspacing=0 cellpadding=0 width=513 style='width:385.0pt;margin-left:-.15pt;border-collapse:collapse'> 
<tr style='height:15.0pt'> 
<td>hdinka</td> 
<td>kya</td> 
</tr> 
<tr style='height:15.0pt'> 
<td>chika</td> 
<td style="background:red">&nbsp</td> 
</tr> 
<tr style='height:15.0pt'> 
<td>pongal</td> 
<td>hawk</td> 
</tr> 
</table> 

如果你想這樣做了所有標記,而不僅僅是<td>你可以嘗試像這樣:

res = re.sub(r'(<\w+)(?!.*background).*?(>)', r'\1\2', html) 
res = re.sub(r'(<\w+).*(background:\w+).*?>', r'\1 style="\2">', res) 
print res 

輸出:

<table> 
<tr> 
<td>hdinka</td> 
<td>kya</td> 
</tr> 
<tr> 
<td>chika</td> 
<td style="background:red">&nbsp</td> 
</tr> 
<tr> 
<td>pongal</td> 
<td>hawk</td> 
</tr> 
</table> 
相關問題