2017-06-01 53 views
1

我有XML文件,它們中包含許多twit和html標記。 在所有其他任務,我需要@emoji 我寫了下面的代碼字代替所有標籤:用XML文件中的一個單詞替換所有<img>標記

for word in re.findall(r"&lt;img[\w\W]+?/&gt;",line): 
    print word 
    line = line.replace(word,'@emoji') 

這是完全爲線工作。 但是,當我嘗試在整個文件的循環中執行時,它不會進入此循環。下面是代碼:

import re 
import xml.etree.ElementTree as ET #xml lib 
filename = 'da0d0e3527b931bb0bc6f5435003ea2a.xml' 
tree = ET.parse(filename) 
root = tree.getroot() 
twits = [] 
for child in root: 
    for grandchild in child: 
     twits.append(grandchild.text) 
for line in twits: 
    for word in re.findall(r"&lt;img[\w\W]+?&gt;",line): 
     line = line.replace(word,'@img') 
    print line 

我也試着用HTML解析器一樣,但我不能把標籤字符串:

imgs = soup.find_all('img') 
for img in imgs: 
    print img 
    emo = str(img) 
    twit.replace(emo,'@emoji') 

XML文件非常大,它發佈但它看起來像這樣:

<author> 
    <documents> 
     <document id="396228853267714048" url="https://twitter.com/ReissSudden/status/396228853267714048">Sooooo many slutty cats knocking around last night</document> 
     <document id="396229373554360320" url="https://twitter.com/ReissSudden/status/396229373554360320">&lt;a href="/AndyLee666" class="twitter-atreply pretty-link js-nav" dir="ltr" data-mentioned-user-id="259958055" &gt;&lt;s&gt;@&lt;/s&gt;&lt;b&gt;AndyLee666&lt;/b&gt;&lt;/a&gt; yep, eye hurts but doesn&amp;#39;t look bad ha ha</document> 
     <document id="396326071467270144" url="https://twitter.com/ReissSudden/status/396326071467270144">Time to start saving for a Skyline</document> 
     <document id="396326916372054016" url="https://twitter.com/ReissSudden/status/396326916372054016">@LaurenWeale where were your halo and wings then?</document> 
     <document id="396327202260017152" url="https://twitter.com/ReissSudden/status/396327202260017152">@LaurenWeale I didn&amp;#39;t see them, and besides, it&amp;#39;s not a scary costume</document> 
     <document id="396327842252075008" url="https://twitter.com/ReissSudden/status/396327842252075008">@LaurenWeale ahh beat me to it &lt;img class="Emoji Emoji--forText" src="https://abs.twimg.com/emoji/v2/72x72/1f609.png" draggable="false" alt="&#128521;" title="Winking face" aria-label="Emoji: Winking face"&gt;</document> 
     <document id="396328213074677763" url="https://twitter.com/ReissSudden/status/396328213074677763">The best chair ever! &lt;a href="/hashtag/halloween?src=hash" data-query-source="hashtag_click" class="twitter-hashtag pretty-link js-nav" dir="ltr" &gt;&lt;s&gt;#&lt;/s&gt;&lt;b&gt;halloween&lt;/b&gt;&lt;/a&gt; &lt;a href="/hashtag/Throne?src=hash" data-query-source="hashtag_click" class="twitter-hashtag pretty-link js-nav" dir="ltr" &gt;&lt;s&gt;#&lt;/s&gt;&lt;b&gt;Throne&lt;/b&gt;&lt;/a&gt; &lt;a href="/hashtag/Devil?src=hash" data-query-source="hashtag_click" class="twitter-hashtag pretty-link js-nav" dir="ltr" &gt;&lt;s&gt;#&lt;/s&gt;&lt;b&gt;Devil&lt;/b&gt;&lt;/a&gt; &lt;a href="/hashtag/anyforty?src=hash" data-query-source="hashtag_click" class="twitter-hashtag pretty-link js-nav" dir="ltr" &gt;&lt;s&gt;#&lt;/s&gt;&lt;b&gt;anyforty&lt;/b&gt;&lt;/a&gt; &lt;a href="/hashtag/wasted?src=hash" data-query-source="hashtag_click" class="twitter-hashtag pretty-link js-nav" dir="ltr" &gt;&lt;s&gt;#&lt;/s&gt;&lt;b&gt;wasted&lt;/b&gt;&lt;/a&gt; &lt;a href="/hashtag/king?src=hash" data-query-source="hashtag_click" class="twitter-hashtag pretty-link js-nav" dir="ltr" &gt;&lt;s&gt;#&lt;/s&gt;&lt;b&gt;king&lt;/b&gt;&lt;/a&gt; &lt;a href="http://somelink" rel="nofollow noopener" dir="ltr" data-expanded-url="http://instagram.com/p/gLi_B9EfOp/" class="twitter-timeline-link" target="_blank" title="http://instagram.com/p/gLi_B9EfOp/" &gt;&lt;span class="tco-ellipsis"&gt;&lt;/span&gt;&lt;span class="invisible"&gt;http://&lt;/span&gt;&lt;span class="js-display-url"&gt;instagram.com/p/gLi_B9EfOp/&lt;/span&gt;&lt;span class="invisible"&gt;&lt;/span&gt;&lt;span class="tco-ellipsis"&gt;&lt;span class="invisible"&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;</document> 
     <document id="396328831285735424" url="https://twitter.com/ReissSudden/status/396328831285735424">@LaurenWeale sorry, that was mean</document> 
     <document id="396337843909713920" url="https://twitter.com/ReissSudden/status/396337843909713920">@LaurenWeale :(don&amp;#39;t be like that</document> 
     <document id="396342701568040960" url="https://twitter.com/ReissSudden/status/396342701568040960">@LaurenWeale be like that then &lt;img class="Emoji Emoji--forText" src="https://abs.twimg.com/emoji/v2/72x72/1f624.png" draggable="false" alt="&#128548;" title="Face with look of triumph" aria-label="Emoji: Face with look of triumph"&gt;</document> 
     <document id="396345875360129024" url="https://twitter.com/ReissSudden/status/396345875360129024">Been a pure lazy day today &lt;img class="Emoji Emoji--forText" src="https://abs.twimg.com/emoji/v2/72x72/1f44c.png" draggable="false" alt="&#128076;" title="Ok hand sign" aria-label="Emoji: Ok hand sign"&gt;</document> 
    </documents> 
</author> 

謝謝你的幫助!

+0

你有沒有試過're.sub(pattern,'@emoji',line)'但呢?這會使循環過時。 – Boldewyn

+0

@Boldewyn剛剛嘗試過,仍然不工作:( –

+0

爲什麼不只是使用'sed'? –

回答

1

解析可以read,並可以做文件之前re.sub在該數據與@emoji更換IMG,然後使用ET.fromstring解析它。你可以做到這一點像

from re import sub 
import xml.etree.ElementTree as ET #xml lib 
data = 'da0d0e3527b931bb0bc6f5435003ea2a.xml' 
data = re.sub(r"&lt;img[\w\W]","&lt;@emoji",open(data).read()) 
tree = ET.fromstring(data) 

現在數據將包含&lt;@emoji&lt;img所有地方。現在你可以根據你的意願解析結果數據。

+0

謝謝,它正在工作! –

相關問題