Python Unicode正則表達式

我正在使用python 2.4，並且遇到了unicode正則表達式的一些問題。我試圖將一個非常清晰和簡明的例子解釋爲我的問題。它看起來好像是Python如何識別不同的字符編碼或者我的理解有問題。非常感謝您參觀！Python Unicode正則表達式

#!/usr/bin/python 
# 
# This is a simple python program designed to show my problems with regular expressions and character encoding in python 
# Written by Brian J. Stinar 
# Thanks for the help! 

import urllib # To get files off the Internet 
import chardet # To identify charactor encodings 
import re # Python Regular Expressions 
#import ponyguruma # Python Onyguruma Regular Expressions - this can be uncommented if you feel like messing with it, but I have the same issue no matter which RE's I'm using 

rawdata = urllib.urlopen('http://www.cs.unm.edu/~brian.stinar/legal.html').read() 
print (chardet.detect(rawdata)) 
#print (rawdata) 

ISO_8859_2_encoded = rawdata.decode('ISO-8859-2') # Let's grab this as text 
UTF_8_encoded = ISO_8859_2_encoded.encode('utf-8') # and encode the text as UTF-8 
print(chardet.detect(UTF_8_encoded)) # Looks good 

# This totally doesn't work, even though you can see UNSUBSCRIBE in the HTML 
# Eventually, I want to recognize the entire physical address and UNSUBSCRIBE above it 
re_UNSUB_amsterdam = re.compile(".*UNSUBSCRIBE.*", re.UNICODE) 
print (str(re_UNSUB_amsterdam.match(UTF_8_encoded)) + "\t\t\t\t\t--- RE for UNSUBSCRIBE on UTF-8") 
print (str(re_UNSUB_amsterdam.match(rawdata)) + "\t\t\t\t\t--- RE for UNSUBSCRIBE on raw data") 

re_amsterdam = re.compile(".*Adobe.*", re.UNICODE) 
print (str(re_amsterdam.match(rawdata)) + "\t--- RE for 'Adobe' on raw data") # However, this work?!? 
print (str(re_amsterdam.match(UTF_8_encoded)) + "\t--- RE for 'Adobe' on UTF-8") 

''' 
# In additon, I tried this regular expression library much to the same unsatisfactory result 
new_re = ponyguruma.Regexp(".*UNSUBSCRIBE.*") 
if new_re.match(UTF_8_encoded) != None: 
    print("Ponyguruma RE matched! \t\t\t--- RE for UNSUBSCRIBE on UTF-8") 
else: 
    print("Ponyguruma RE did not match\t\t--- RE for UNSUBSCRIBE on UTF-8") 

if new_re.match(rawdata) != None: 
    print("Ponyguruma RE matched! \t\t\t--- RE for UNSUBSCRIBE on raw data") 
else: 
    print("Ponyguruma RE did not match\t\t--- RE for UNSUBSCRIBE on raw data") 

new_re = ponyguruma.Regexp(".*Adobe.*") 
if new_re.match(UTF_8_encoded) != None: 
    print("Ponyguruma RE matched! \t\t\t--- RE for Adobe on UTF-8") 
else: 
    print("Ponyguruma RE did not match\t\t\t--- RE for Adobe on UTF-8") 

new_re = ponyguruma.Regexp(".*Adobe.*") 
if new_re.match(rawdata) != None: 
    print("Ponyguruma RE matched! \t\t\t--- RE for Adobe on raw data") 
else: 
    print("Ponyguruma RE did not match\t\t\t--- RE for Adobe on raw data") 
'''

我正在開發一個替代項目，並且在使用非ASCII編碼文件時遇到困難。這個問題是一個更大的項目的一部分 - 最終我想用其他文本替換文本（我用ASCII工作，但我無法確定其他編碼中的出現）。再次感謝。

http://brian-stinar.blogspot.com

布賴恩J. Stinar-

來源

2009-07-22 Brian Stinar

東西完全是從你的描述缺少的是在你的代碼失敗的方式。你在你的代碼中編寫*「＃這完全不起作用」*，但是你沒有提示它如何不起作用。打印的字符串是否爲空？你會得到錯誤消息/堆棧跟蹤？ – ThomasH 2009-07-23 12:15:31

你可能想要麼使DOTALL標誌，或者您想使用的，而不是match方法search方法。即：

# DOTALL makes . match newlines 
re_UNSUB_amsterdam = re.compile(".*UNSUBSCRIBE.*", re.UNICODE | re.DOTALL)

或：

# search will find matches even if they aren't at the start of the string 
... re_UNSUB_amsterdam.search(foo) ...

這些會給你不同的結果，但兩者應該給你匹配。（看看哪一個是你想要的類型。）

順便說一句：你似乎正在獲取編碼文本（這是字節）和解碼文本（字符）混淆。這並不罕見，特別是在3.x之前的Python中。具體而言，這是非常可疑：

ISO_8859_2_encoded = rawdata.decode('ISO-8859-2')

你德與ISO-8859-2，不EN -coding -coding，所以叫這個變量「解碼」。（爲什麼不「ISO_8859_2_decoded」？因爲ISO_8859_2是一種編碼，解碼後的字符串不再具有編碼）

其餘代碼嘗試在rawdata和UTF_8_encoded（兩種編碼字符串）上進行匹配它可能應該使用解碼的Unicode字符串。

來源

2009-07-23 00:14:40

非常感謝。添加完re.DOTALL標誌後，其行爲與我所期望的完全相同。它看起來像。*在ASCII上表現不同，在ASCII中，它與我匹配的換行符，但與解碼的非ASCII不是，但我可能只是不清楚這一點。感謝您澄清編碼文本和解碼文本。這是我處理不同編碼的第一個項目，我讚賞澄清。 – 2009-07-24 14:37:48

這可能幫助：http://www.daa.com.au/pipermail/pygtk/2009-July/017299.html

來源

2009-07-23 00:02:25 b3rx

使用默認標誌設置，。*與換行符不匹配。在第一個換行符後，UNSUBSCRIBE只出現一次。 Adobe在第一個換行符之前發生。你可以通過使用re.DOTALL來解決這個問題。

然而，你沒有檢查你得到的與Adobe匹配：它的1478字節寬！打開re.DOTALL，它（和相應的UNSUBSCRIBE模式）將匹配整個文本！

你絕對需要失去最後的結果。* - 你不感興趣並且會減慢比賽速度。你也應該失去領先。*並使用search（）而不是match（）。

在這種情況下，re.UNICODE標誌對您沒有用處 - 請閱讀手冊並查看其功能。

爲什麼要將數據轉碼爲UTF-8並在其上搜索？留在Unicode中。

其他人指出，一般來說，你需要做你的數據的任何嚴肅的工作之前，Ӓ等一樣的東西......解碼但未提及與您的數據穿插:-)

來源

2009-07-23 02:11:17

的 «等一樣的東西

你的問題是關於正則表達式的，但是你的問題可以在沒有它們的情況下解決。改爲使用標準字符串replace的方法。

import urllib 
raw = urllib.urlopen('http://www.cs.unm.edu/~brian.stinar/legal.html').read() 
decoded = raw.decode('iso-8859-2') 
type(decoded) # decoded is now <type 'unicode'> 
substituted = decoded.replace(u'UNSUBSCRIBE', u'whatever you prefer')

如果沒有別的，上面顯示瞭如何處理編碼：簡單地解碼成一個Unicode字符串並使用它。但是請注意，這隻適用於只有一個或很少數量的替換（以及那些替換不是基於模式）的情況，因爲replace()一次只能處理一個替換。

對於這兩個字符串，並基於模式替代，你可以做這樣的事情，一次實現多個替代：

import re 
REPLACEMENTS = ((u'[aA]dobe', u'!twiddle!'), 
       (u'UNS.*IBE', u'@[email protected]'), 
       (u'Dublin', u'Sydney')) 

def replacer(m): 
    return REPLACEMENTS[list(m.groups()).index(m.group(0))][1] 

r = re.compile('|'.join('(%s)' % t[0] for t in REPLACEMENTS)) 
substituted = r.sub(replacer, decoded)

來源

2009-07-23 04:01:02 mhawke

Python Unicode正則表達式

回答

相關問題