2011-10-05 86 views
0

我正在使用XML :: LibXML解析XML文件。對於下面的XML條目我得到的錯誤:使用XML解析XML時出現格式錯誤的UTF-8字符(致命錯誤):: LibXML

Malformed UTF-8 character (fatal) at C:/Perl64/site/lib/XML/LibXML/Error.pm line 217 

這是

$context=~s/[^\t]/ /g; 

XML中的條目下面

<MedlineCitation Owner="NLM" Status="MEDLINE"> 
<PMID Version="1">15177811</PMID> 
<DateCreated> 
<Year>2004</Year> 
<Month>06</Month> 
<Day>04</Day> 
</DateCreated> 
<DateCompleted> 
<Year>2004</Year> 
<Month>08</Month> 
<Day>11</Day> 
</DateCompleted> 
<DateRevised> 
<Year>2011</Year> 
<Month>04</Month> 
<Day>07</Day> 
</DateRevised> 
<Article PubModel="Print"> 
<Journal> 
<ISSN IssnType="Print">0278-2626</ISSN> 
<JournalIssue CitedMedium="Print"> 
<Volume>55</Volume> 
<Issue>2</Issue> 
<PubDate> 
<Year>2004</Year> 
<Month>Jul</Month> 
</PubDate> 
</JournalIssue> 
<Title>Brain and cognition</Title> 
<ISOAbbreviation>Brain Cogn</ISOAbbreviation> 
</Journal> 
<ArticleTitle>Efficiency of orientation channels in the striate cortex for distributed categorization process.</ArticleTitle> 
<Pagination> 
<MedlinePgn>352-4</MedlinePgn> 
</Pagination> 
<Affiliation>Cognitive Science Department, Université de Liège, Belgium. [email protected]</Affiliation> 
<AuthorList CompleteYN="Y"> 
<Author ValidYN="Y"> 
<LastName>Mermillod</LastName> 
<ForeName>Martial</ForeName> 
<Initials>M</Initials> 
</Author> 
<Author ValidYN="Y"> 
<LastName>Chauvin</LastName> 
<ForeName>Alan</ForeName> 
<Initials>A</Initials> 
</Author> 
<Author ValidYN="Y"> 
<LastName>Guyader</LastName> 
<ForeName>Nathalie</ForeName> 
<Initials>N</Initials> 
</Author> 
</AuthorList> 
<Language>eng</Language> 
<PublicationTypeList> 
<PublicationType>Journal Article</PublicationType> 
</PublicationTypeList> 
</Article> 
<MedlineJournalInfo> 
<Country>United States</Country> 
<MedlineTA>Brain Cogn</MedlineTA> 
<NlmUniqueID>8218014</NlmUniqueID> 
<ISSNLinking>0278-2626</ISSNLinking> 
</MedlineJournalInfo> 
<CitationSubset>IM</CitationSubset> 
<CommentsCorrectionsList> 
<CommentsCorrections RefType="ErratumIn"> 
<RefSource>Brain Cogn. 2005 Jul;58(2):245</RefSource> 
</CommentsCorrections> 
<CommentsCorrections RefType="RepublishedIn"> 
<RefSource>Brain Cogn. 2005 Jul;58(2):246-8</RefSource> 
<PMID Version="1">16044513</PMID> 
</CommentsCorrections> 
</CommentsCorrectionsList> 
<MeshHeadingList> 
<MeshHeading> 
<DescriptorName MajorTopicYN="Y">Neural Networks (Computer)</DescriptorName> 
</MeshHeading> 
<MeshHeading> 
<DescriptorName MajorTopicYN="N">Neurons</DescriptorName> 
<QualifierName MajorTopicYN="N">physiology</QualifierName> 
</MeshHeading> 
<MeshHeading> 
<DescriptorName MajorTopicYN="N">Orientation</DescriptorName> 
<QualifierName MajorTopicYN="Y">physiology</QualifierName> 
</MeshHeading> 
<MeshHeading> 
<DescriptorName MajorTopicYN="N">Pattern Recognition, Visual</DescriptorName> 
<QualifierName MajorTopicYN="Y">physiology</QualifierName> 
</MeshHeading> 
<MeshHeading> 
<DescriptorName MajorTopicYN="N">Visual Cortex</DescriptorName> 
<QualifierName MajorTopicYN="Y">physiology</QualifierName> 
</MeshHeading> 
</MeshHeadingList> 
</MedlineCitation> 

但我想從這個入口的東西PMID,DateRevised,PubDate,ArticleTitle,CommentsCorrectionList和MeshHeadingList。但是,如果我刪除包含其他字符的隸屬關係,則此錯誤不再存在。我應該如何解決這個錯誤?

+0

是您文件實際上保存在UTF-8中?我懷疑這不是,但是LibXML認爲它是,並且在它碰到「列日大學」時會發瘋。 –

+0

@XavierHolt由於您的意思是「<?xml version =」1.0「encoding =」UTF-8「?>」在文件的開頭?如果是的話,它有這條線。如果這是一個愚蠢的問題,我很抱歉,我不是這個領域的。 – smandape

+1

這是它的一半。該部分告訴你的XML解析器需要什麼字符編碼。另一半是將文件保存到磁盤中的編碼。例如,如果您將文件保存爲UTF-8,則「é」字符將由字節序列「0xC3A9」表示,但如果您將文件保存在Windows-1252,它將由單個字節「0xE9」表示。如果LibXML期待UTF-8字符,但遇到不是UTF-8的東西,則會引發錯誤。 –

回答

4

您可以將文件轉換爲指定的編碼(UTF-8),也可以指定實際用於該文件的編碼。 (<?xml version="1.0" encoding="cp1252"?>)。

記事本可以用來轉換爲UTF-8,所以可以的Perl:

perl -pe" 
    BEGIN { 
     binmode STDIN, ':encoding(cp1252)'; 
     binmode STDOUT, ':encoding(UTF-8)'; 
    } 
" <file.cp1252> file.UTF-8 

(你必須去掉換行符我已經添加了可讀性)

+2

[piconv](http://p3rl.org/piconv)附帶Perl。 'piconv -f cp1252 -t UTF-8 file.UTF-8' – daxim

+0

@daxim,Cool,nI從來沒有聽說過它。 – ikegami