2017-10-21 88 views
1

我已經通過ElementTree解析器將一個xml文件(Rhythmbox的數據庫文件)加載到Python 3中。使用ascii編碼修改樹並將其寫入磁盤(ElementTree.write())後,所有十六進制代碼點中的ASCII十六進制字符都將轉換爲ASCII十進制代碼點。例如下面是一個包含一個diff版權符號:如何在Python中編寫ElementTree時保留ASCII十六進制代碼點?

<  <copyright>&#xA9; WNYC</copyright> 
--- 
>  <copyright>&#169; WNYC</copyright> 

有什麼辦法來告訴Python/ElementTree的不這樣做呢?我希望所有的十六進制代碼保持十六進制代碼點。

+0

這是多麼令人討厭。對不起,我不知道ElementTree足以回答你的問題。 (FWIW,我的電子閱讀器的十進制比十六進制更好,所以我有相反的問題)。如果您沒有找到強制使用十六進制的方法,使用正則表達式很容易將十進制實體轉換爲十六進制。 OTOH,在當今時代,大多數設備都具有良好的UTF-8支持,因此您可以將這些實體轉換爲Unicode,並將輸出文件編碼爲UTF-8。 –

+0

我不想用不同的編碼或不同的代碼點修改數據庫文件的格式。我希望它保持與Rhytmbox的格式完全兼容。 – moorepants

+0

這是有道理的。 OTOH,如果Rhythmbox不爲其XML文件使用UTF-8,我會感到驚訝。當然,ASCII是UTF-8的一個子集,因此,即使Rhythmbox支持UTF-8,也可以使您的XML成爲嚴格的ASCII碼。 –

回答

1

我找到了解決方案。首先,我創建了一個新的編解碼器錯誤處理程序,然後使用修補程序ElementTree._get_writer()來使用新的錯誤處理程序。看起來像:

from xml.etree import ElementTree 
import io 
import contextlib 
import codecs 


def lower_first(s): 
    return s[:1].lower() + s[1:] if s else '' 


def html_replace(exc): 
    if isinstance(exc, (UnicodeEncodeError, UnicodeTranslateError)): 
     s = [] 
     for c in exc.object[exc.start:exc.end]: 
      s.append('&#%s;' % lower_first(hex(ord(c))[1:].upper())) 
     return ''.join(s), exc.end 
    else: 
     raise TypeError("can't handle %s" % exc.__name__) 

codecs.register_error('html_replace', html_replace) 


# monkey patch this python function to prevent it from using xmlcharrefreplace 
@contextlib.contextmanager 
def _get_writer(file_or_filename, encoding): 
    # returns text write method and release all resources after using 
    try: 
     write = file_or_filename.write 
    except AttributeError: 
     # file_or_filename is a file name 
     if encoding == "unicode": 
      file = open(file_or_filename, "w") 
     else: 
      file = open(file_or_filename, "w", encoding=encoding, 
         errors="html_replace") 
     with file: 
      yield file.write 
    else: 
     # file_or_filename is a file-like object 
     # encoding determines if it is a text or binary writer 
     if encoding == "unicode": 
      # use a text writer as is 
      yield write 
     else: 
      # wrap a binary writer with TextIOWrapper 
      with contextlib.ExitStack() as stack: 
       if isinstance(file_or_filename, io.BufferedIOBase): 
        file = file_or_filename 
       elif isinstance(file_or_filename, io.RawIOBase): 
        file = io.BufferedWriter(file_or_filename) 
        # Keep the original file open when the BufferedWriter is 
        # destroyed 
        stack.callback(file.detach) 
       else: 
        # This is to handle passed objects that aren't in the 
        # IOBase hierarchy, but just have a write method 
        file = io.BufferedIOBase() 
        file.writable = lambda: True 
        file.write = write 
        try: 
         # TextIOWrapper uses this methods to determine 
         # if BOM (for UTF-16, etc) should be added 
         file.seekable = file_or_filename.seekable 
         file.tell = file_or_filename.tell 
        except AttributeError: 
         pass 
       file = io.TextIOWrapper(file, 
             encoding=encoding, 
             errors='html_replace', 
             newline="\n") 
       # Keep the original file open when the TextIOWrapper is 
       # destroyed 
       stack.callback(file.detach) 
       yield file.write 

ElementTree._get_writer = _get_writer 
+0

我沒有仔細研究過你的代碼(我需要更多地瞭解ElementTree才能完全理解它),但是你可以將'html_replace'的核心代碼簡化爲:'s.append('&#x%X;' ord(c))',它既更緊湊又更快速。 –

相關問題