2016-01-23 223 views
0

我有一個奇怪的正則表達式問題,其中我的正則表達式在pythex上工作,但不在python本身。我現在使用2.7。我想刪除所有Unicode實例像\x92,其中有很多(如'Thomas Bradley \x93Brad\x94 Garza',正則表達式在pythex上工作,但不是python2.7,通過正則表達式找到unicode表達式

import re, requests 

def purify(string): 
    strange_issue = r"""\\t<td><font size=2>G<td><a href="http://facebook.com/KilledByPolice/posts/625590984135709" target=new><font size=2><center>facebook.com/KilledByPolice/posts/625590984135709\t</a><td><a href="http://www.orlandosentinel.com/news/local/lake/os-leesburg-officer-involved-shooting-20130507""" 
    unicode_chars_rgx = r"[\\][x]\d+" 
    unicode_matches = re.findall(unicode_chars_rgx, string) 
    bad_list = [strange_issue] 
    bad_list.extend(unicode_matches) 
    for item in bad_list: 
     string = string.replace(item, "") 
    return string 

name_rgx = r"(?:[<][TDtd][>])|(?:target[=]new[>])(?P<the_deceased>[A-Z].*?)[,]" 

urls = {2013: "http://www.killedbypolice.net/kbp2013.html", 
     2014: "http://www.killedbypolice.net/kbp2014.html", 
     2015: "http://www.killedbypolice.net/" } 

names_of_the_dead = [] 

for url in urls.values(): 
    response = requests.get(url) 
    content = response.content 
    people_killed_by_police_that_year_alone = re.findall(name_rgx, content) 
    for dead_person in people_killed_by_police_that_year_alone: 
     names_of_the_dead.append(purify(dead_person)) 

dead_americans_as_string = ", ".join(names_of_the_dead) 
print("RIP, {} since 2013:\n".format(len(names_of_the_dead))) # 3085! :) 
print(dead_americans_as_string) 



In [95]: unicode_chars_rgx = r"[\\][x]\d+" 

In [96]: testcase = "Myron De\x92Shawn May" 

In [97]: x = purify(testcase) 

In [98]: x 
Out[98]: 'Myron De\x92Shawn May' 

In [103]: match = re.match(unicode_chars_rgx, testcase) 

In [104]: match 

我怎樣才能得到這些\x00人物幫忙謝謝

回答

1

當然不是試圖找到的東西,看看嗎?像 「\\x00

如果你想破壞數據:

>>> re.sub('[\x7f-\xff]', '', "Myron De\x92Shawn May") 
'Myron DeShawn May' 

更多的工作,但會嘗試保留文本以及可能的:

>>> import unidecode 
>>> unidecode.unidecode("Myron De\x92Shawn May".decode('cp1251')) 
"Myron De'Shawn May" 
+0

肯定的,只是正確地固定unicode是我想要做的 – codyc4321

+0

那麼你或許應該要求的文本,而不是內容的請求開始。 –

+0

爲什麼7f要ff?我看到一些只有2個數字的unicode,如92或93 – codyc4321