2010-01-28 43 views
0

我想使用JDOM加載遠程HTML文件(Blogger配置文件)的源代碼。我有這樣的代碼:使用JDOM加載遠程html源代碼問題

public Document getDoc(URL url) throws JDOMException, IOException{ 
    SAXBuilder saxBuilder = new SAXBuilder(); 
    saxBuilder.setFeature("http://xml.org/sax/features/validation", false); 
    saxBuilder.setFeature("http://apache.org/xml/features/nonvalidating/load-dtd-grammar", false); 
    saxBuilder.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false); 
    saxBuilder.setValidation(false); 
    Document doc = saxBuilder.build(url.openStream()); 
    return doc; 
} 

當我嘗試運行是這樣的:

public static void main(String[] args) throws BadLocationException, JDOMException, IOException{ 
     linkExtractor(new URL("http://www.blogger.com/profile/07059093309718767384")); 
} 

我得到這個異常:

run: 
Exception in thread "main" org.jdom.input.JDOMParseException: Error on line 1: The entity name must immediately follow the '&' in the entity reference. 
    at org.jdom.input.SAXBuilder.build(SAXBuilder.java:468) 
    at org.jdom.input.SAXBuilder.build(SAXBuilder.java:770) 
    at tc.Crawler.linkExtractor(Crawler.java:60) 
    at tc.Crawler.main(Crawler.java:44) 
Caused by: org.xml.sax.SAXParseException: The entity name must immediately follow the '&' in the entity reference. 
    at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:195) 
    at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:174) 
    at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:388) 
    at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1414) 
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEntityReference(XMLDocumentFragmentScannerImpl.java:1838) 
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:3024) 
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648) 
    at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140) 
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510) 
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:807) 
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) 
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:107) 
    at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) 
    at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522) 
    at org.jdom.input.SAXBuilder.build(SAXBuilder.java:453) 
    ... 3 more 
Caused by: org.xml.sax.SAXParseException: The entity name must immediately follow the '&' in the entity reference. 
    at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:195) 
    at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:174) 
    at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:388) 
    at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1414) 
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEntityReference(XMLDocumentFragmentScannerImpl.java:1838) 
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:3024) 
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648) 
    at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140) 
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:510) 
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:807) 
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) 
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:107) 
    at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) 
    at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522) 
    at org.jdom.input.SAXBuilder.build(SAXBuilder.java:453) 
    at org.jdom.input.SAXBuilder.build(SAXBuilder.java:770) 
    at tc.Crawler.linkExtractor(Crawler.java:60) 
    at tc.Crawler.main(Crawler.java:44) 

注意,我不得不添加此行:

saxBuilder.setFeature("http://xml.org/sax/features/validation", false); 
    saxBuilder.setFeature("http://apache.org/xml/features/nonvalidating/load-dtd-grammar", false); 
    saxBuilder.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false); 
    saxBuilder.setValidation(false); 

因爲起初,加載URL: http://www.w3.org/TR/html4/strict.dtd時,我收到503錯誤。

謝謝。

回答

4

使用XML解析器解析HTML並不是最好的辦法。首先考慮使用類似NekoHTML的東西。

+0

感謝您的建議和鏈接。 – miguelrios 2010-01-29 18:37:47