2013-03-03 69 views
0

我想讀取一些數據,這是假設是製表符分隔但我看到很多#FO#在裏面?在Python中清理字符串(可能是編碼的字符串)

我想知道如何清理文本?

示例代碼段

title=#F0#Sometimes#F0#the#F0#Grave#F0#Is#F0#a#F0#Fine#F0#and#F0#Public#F0#Place.#F0#|url=http://query.nytimes.com/gst/fullpage.html? 
res=940DEFD71230F93BA15750C0A9629C8B63#F0#|quote=New#F0#Jersey#F0#is,#F0#indeed,#F0#a#F0#hom 
e#F0#of#F0#poets.#F0#Walt#F0#Whitman's#F0#tomb#F0#is#F0#nestled#F0#in#F0#a#F0#wooded#F0#grov 
e#F0#in#F0#the#F0#Harleigh#F0#Cemetery#F0#in#F0#Camden.#F0#Joyce#F0#Kilmer#F0#is#F0#buried#F 
0#in#F0#Elmwood#F0#Cemetery#F0#in#F0#New#F0#Brunswick,#F0#not#F0#far#F0#from#F0#the#F0#New#F 
0#Jersey#F0#Turnpike#F0#rest#F0#stop#F0#named#F0#in#F0#his#F0#honor.#F0#Allen#F0#Ginsberg#F0 
#may#F0#not#F0#yet#F0#have#F0#a#F0#rest#F0#stop,#F0#but#F0#the#F0#Beat#F0#Generation#F0#auth 
or#F0#of#F0#"Howl"#F0#is#F0#resting#F0#at#F0#B'Nai#F0#Israel#F0#Cemetery#F0#in#F0#Newark.#F0 
#|work=The#F0#New#F0#York#F0#Times#F0#|date=March#F0#28,#F0#2004#F0#|accessdate=August#F0#21 
+0

哪裏這是從哪裏來的? (只要做's.replace(「#F0 $)',這很容易,我只是想知道其中的其他編碼) – 2013-03-03 01:43:54

+0

這是一個xml數據集,使用hadoop – user2052251 2013-03-03 01:50:16

回答

0

titleres字符串,然後使用[s.replace(old, new)][1]

title="#F0#Sometimes#F0#the#F0#Grave#F0#Is#F0#a#F0#Fine#F0#and#F0#Public#F0#Place.#F0#|url=http://query.nytimes.com/gst/fullpage.html?" 
res="""940DEFD71230F93BA15750C0A9629C8B63#F0#|quote=New#F0#Jersey#F0#is,#F0#indeed,#F0#a#F0#hom 
e#F0#of#F0#poets.#F0#Walt#F0#Whitman's#F0#tomb#F0#is#F0#nestled#F0#in#F0#a#F0#wooded#F0#grov 
e#F0#in#F0#the#F0#Harleigh#F0#Cemetery#F0#in#F0#Camden.#F0#Joyce#F0#Kilmer#F0#is#F0#buried#F 
0#in#F0#Elmwood#F0#Cemetery#F0#in#F0#New#F0#Brunswick,#F0#not#F0#far#F0#from#F0#the#F0#New#F 
0#Jersey#F0#Turnpike#F0#rest#F0#stop#F0#named#F0#in#F0#his#F0#honor.#F0#Allen#F0#Ginsberg#F0 
#may#F0#not#F0#yet#F0#have#F0#a#F0#rest#F0#stop,#F0#but#F0#the#F0#Beat#F0#Generation#F0#auth 
or#F0#of#F0#"Howl"#F0#is#F0#resting#F0#at#F0#B'Nai#F0#Israel#F0#Cemetery#F0#in#F0#Newark.#F0 
#|work=The#F0#New#F0#York#F0#Times#F0#|date=March#F0#28,#F0#2004#F0#|accessdate=August#F0#21""" 

title = title.replace('#FO#', '') 
res = res.replace('#FO#', '')