正如評論中的建議,重新考慮在HTML/XML文檔中直接使用正則表達式,因爲這些不是常規語言。相反,在解析的文本/值內容上使用正則表達式,但不能轉換文檔。
一個偉大的XML操縱工具是XSLT,轉換語言和兄弟到XPath。 Java帶有內置的XSLT 1.0處理器,並且還可以調用或獲取外部處理器(Xalan, Saxon, etc.)。考慮以下設置:
XSLT腳本(另存爲。下面使用的xsl文件;腳本刪除空節點)
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>
<!-- Identity Transform to Copy Document as is -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<!-- Empty Template to Remove Such Nodes -->
<xsl:template match="*[.='']"/>
</xsl:transform>
的Java代碼
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.*;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.TransformerException;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;
import javax.xml.transform.OutputKeys;
import java.io.File;
import java.io.IOException;
import java.net.URISyntaxException;
import org.w3c.dom.Document;
import org.xml.sax.SAXException;
public class XMLTransform {
public static void main(String[] args) throws IOException, URISyntaxException,
SAXException, ParserConfigurationException,
TransformerException {
// Load XML and XSL Document
String inputXML = "path/to/Input.xml";
String xslFile = "path/to/XSLT/Script.xsl";
String outputXML = "path/to/Output.xml";
Source xslt = new StreamSource(new File(xslFile));
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
Document doc = docBuilder.parse (new File(inputXML));
// XSLT Transformation with pretty print
TransformerFactory prettyPrint = TransformerFactory.newInstance();
Transformer transformer = prettyPrint.newTransformer(xslt);
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
transformer.setOutputProperty(OutputKeys.STANDALONE, "yes");
transformer.setOutputProperty(OutputKeys.METHOD, "xml");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");
DOMSource source = new DOMSource(doc);
StreamResult result = new StreamResult(new File(outputXML));
transformer.transform(source, result);
}
}
輸出
<ct>
<c>http://192.168.105.213</c>
<l>http://192.168.105.213</l>
<l>http://192.168.105.213</l>
<o>http://192.168.105.213</o>
</ct>
NAMESPACES
當使用命名空間的,如下面的XML:
<prefix:ct xmlns:prefix="http://www.example.com">
<c>http://192.168.105.213</c>
<l>http://192.168.105.213</l>
<o></o>
<l>http://192.168.105.213</l>
<o>http://192.168.105.213</o>
</prefix:ct>
使用下面的XSLT與聲明中的頭,並添加模板:
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
xmlns:prefix="http://www.example.com">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>
<!-- Identity Transform -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<!-- Retain Namespace Prefix -->
<xsl:template match="ct">
<xsl:element name='prefix:{local-name()}' namespace='http://www.example.com'>
<xsl:copy-of select="namespace::*"/>
<xsl:apply-templates select="node()|@*"/>
</xsl:element>
</xsl:template>
<!-- Remove Empty Nodes -->
<xsl:template match="*[.='']"/>
</xsl:transform>
輸出
<prefix:ct xmlns:prefix="http://www.example.com">
<c>http://192.168.105.213</c>
<l>http://192.168.105.213</l>
<l>http://192.168.105.213</l>
<o>http://192.168.105.213</o>
</prefix:ct>
請,做不使用正則表達式來解析XML。決不。見http://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la – vanje
@vanje我喜歡這個更好地回答:http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags –
@托馬斯:是的,你說得對。 – vanje