查找和基於未知字符

我已經難倒尋找一種方法來查找和替換基於位置的字符替換Python-。基本上我在尋找什麼做進入的文檔和替換查找和基於未知字符

<gco:DateTime>2016-04-20T11:27:34.8677919-06:00</gco:DateTime>

隨着

<gco:DateTime>2016-04-20T11:27:34</gco:DateTime>

一切之後小數字符必須刪除。問題在於，這是針對XML文件中的多個時間戳，並且每個時間戳都完全不同。我讀了一點正則表達式，它似乎是一種可能的方法。任何幫助將不勝感激。

XML文件格式的編輯示例：

<?xml version="1.0" encoding="utf-8"?> 
<?xml-stylesheet type='text/xsl' href='http://ngis/ngis/metadata/StyleSheet/xslt/nGIS_Metadata.xslt'?> 
<gmd:MD_Metadata xmlns:gml="http://www.opengis.net/gml/3.2" xmlns:gmx="http://www.isotc211.org/2005/gmx" xmlns:gts="http://www.isotc211.org/2005/gts" xmlns:gfc="http://www.isotc211.org/2005/gfc" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:gss="http://www.isotc211.org/2005/gss" xmlns:gsr="http://www.isotc211.org/2005/gsr" xmlns:gco="http://www.isotc211.org/2005/gco" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:gmi="http://www.isotc211.org/2005/gmi" xmlns:gmd="http://www.isotc211.org/2005/gmd"> 
    <gmd:fileIdentifier> 
     <gco:CharacterString>BF244A7CB62491BC74B001BE5DEAA213AAFB9DBA</gco:CharacterString> 
    </gmd:fileIdentifier> 
    <gmd:language> 
     <gco:CharacterString>English</gco:CharacterString> 
       <gmd:date> 
       <gco:DateTime>2016-04-20T11:27:34.8677919-06:00</gco:DateTime> 
       </gmd:date>

@Parfait

來源

2016-06-07 MapZombie

的正則表達式將解決這一和其它類似的問題，你應該繼續閱讀它們。在這種特定情況下，解析和格式化日期也是一種好方法。 –

我會進一步警告你不要試圖處理XML太多不使用庫，例如'lxml'或'ElementTree'實際上解析成一個適當的樹，雖然你可能會擺脫它，如果你所有的transormations如無併發症。 – holdenweb

它不能強調不夠（也許是最高的投票SO答案），[不要正則表達式HTML/XML文件（http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-自含標籤）。 – Parfait

考慮XSLT（用於轉換XML文檔的專用聲明性語言），它具有非常方便的功能（與其同級XPath共享）substring-before()您可以在劃分時間戳的時間段之前提取數據。 Python的lxml模塊可以運行XSLT 1.0腳本。

下面的腳本解析XML和XSLT的文件。具體來說，XSLT運行身份變換爲是複製文件，然後提取從所有<gco:DateTime>的時間。只有需要gco命名空間在XSLT頭中定義注意：

XSLT腳本（如外部保存爲在Python中引用的.xsl文件）

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" 
       xmlns:gco="http://www.isotc211.org/2005/gco"> 
<xsl:output version="1.0" encoding="UTF-8" indent="yes" /> 
<xsl:strip-space elements="*"/> 

    <!-- Identity Transform --> 
    <xsl:template match="@*|node()"> 
    <xsl:copy> 
     <xsl:apply-templates select="@*|node()"/> 
    </xsl:copy> 
    </xsl:template> 

    <xsl:template match="gco:DateTime"> 
    <xsl:copy> 
     <xsl:copy-of select="substring-before(., '.')"/>     
    </xsl:copy> 
    </xsl:template> 

</xsl:transform>

的Python腳本

import lxml.etree as ET 

# LOAD XML AND XSL 
dom = ET.parse('Input.xml') 
xslt = ET.parse('XSLTScript.xsl') 

# TRANSFORM XML 
transform = ET.XSLT(xslt) 
newdom = transform(dom) 

# CONVERT TO STRING 
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True) 

# OUTPUT TREE TO FILE 
xmlfile = open('Output.xml') 
xmlfile.write(tree_out) 
xmlfile.close()

輸出

<?xml version="1.0"?> 
<?xml-stylesheet type='text/xsl' href='http://ngis/ngis/metadata/StyleSheet/xslt/nGIS_Metadata.xslt'?><gmd:MD_Metadata xmlns:gml="http://www.opengis.net/gml/3.2" xmlns:gmx="http://www.isotc211.org/2005/gmx" xmlns:gts="http://www.isotc211.org/2005/gts" xmlns:gfc="http://www.isotc211.org/2005/gfc" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:gss="http://www.isotc211.org/2005/gss" xmlns:gsr="http://www.isotc211.org/2005/gsr" xmlns:gco="http://www.isotc211.org/2005/gco" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:gmi="http://www.isotc211.org/2005/gmi" xmlns:gmd="http://www.isotc211.org/2005/gmd"> 
    <gmd:fileIdentifier> 
    <gco:CharacterString>BF244A7CB62491BC74B001BE5DEAA213AAFB9DBA</gco:CharacterString> 
    </gmd:fileIdentifier> 
    <gmd:language> 
    <gco:CharacterString>English</gco:CharacterString> 
    <gmd:date> 
     <gco:DateTime>2016-04-20T11:27:34</gco:DateTime> 
    </gmd:date> 
    </gmd:language> 
</gmd:MD_Metadata>

來源

2016-06-08 00:42:34 Parfait

感謝Parfait，這非常棒。真的很感激它！ – MapZombie

我的檔案全部以開頭<？xml version =「1.0」encoding =「utf-8」？> <？xml-stylesheet type ='text/xsl'href ='http：//xxxxx.com'？> MapZombie

請發佈snippet of actual xml（它的所有頭文件，因爲您有一個應該定義的名稱空間gco'）。你不應該從第三條線開始。 – Parfait

一種方式：

s = "<gco:DateTime>2016-04-20T11:27:34.8677919-06:00</gco:DateTime>" 
split_on_dot = s.split('.') 
split_on_angle = split_on_dot[1].split('<') 
new_s = "".join([split_on_dot[0], "<", split_on_angle[1]]) 

>>> new_s 
'<gco:DateTime>2016-04-20T11:27:34</gco:DateTime>' 
>>>

這依賴於週期是在輸入字符串的唯一時間。我不擅長正則表達式。我認爲他們被濫用，但我確定有人會告訴你如何使用正則表達式。只要記住python本身就有很好的字符串操作。

來源

2016-06-07 22:46:16

感謝joel，我需要這個能夠解析每個文件的多個未知日期。在每個文件中有大約6個這種格式的日期戳。而且每種格式都是一致的，只用了一個時間段。 – MapZombie

然後，很好，但留意@holdenweb關於xml解析的評論。一旦你有了你想要改變的元素，我的回答就會照顧到事物。 Stephen Holden向我介紹了python，他教導了 –

查找和基於未知字符

回答

相關問題