2011-08-26 88 views
2

我正在使用HtmlUnit 2.9(本月發佈的穩定版本)。你有什麼想法爲什麼下面的代碼不工作?非常簡單的代碼不能在HtmlUnit中工作

public class Main { 

    public static void main(String[] args) { 
     WebClient webClient = new WebClient(BrowserVersion.FIREFOX_3_6); 
     webClient.setCssEnabled(true); 
     webClient.setCssErrorHandler(new SilentCssErrorHandler()); 
     webClient.setThrowExceptionOnFailingStatusCode(false); 
     webClient.setThrowExceptionOnScriptError(false); 
     webClient.setRedirectEnabled(false); 
     webClient.setAppletEnabled(false); 
     webClient.setJavaScriptEnabled(false); 
     webClient.setPopupBlockerEnabled(true); 
     webClient.setTimeout(60000); 
     webClient.setPrintContentOnFailingStatusCode(false); 

     System.out.println("This is printed on screen"); 
     try { 
      webClient.getPage("http://www.2cash.info/index.php"); 
     } catch (Exception e) { 
      e.printStackTrace(); 
     } 
     System.out.println("This is NEVER printed on screen"); 
    } 
} 

我還添加了jstack的結果。注意我已經標記了一個不斷重複的部分:

2011-08-26 03:15:45 
Full thread dump Java HotSpot(TM) Server VM (20.1-b02 mixed mode): 

"Attach Listener" daemon prio=10 tid=0x09520400 nid=0x5363 waiting on condition [0x00000000] 
    java.lang.Thread.State: RUNNABLE 

"JS executor for [email protected]" daemon prio=10 tid=0x6feb7400 nid=0x5356 waiting on condition [0x6fcfe000] 
    java.lang.Thread.State: TIMED_WAITING (sleeping) 
    at java.lang.Thread.sleep(Native Method) 
    at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptExecutor.run(JavaScriptExecutor.java:166) 
    at java.lang.Thread.run(Thread.java:662) 

"Low Memory Detector" daemon prio=10 tid=0x70204c00 nid=0x5352 runnable [0x00000000] 
    java.lang.Thread.State: RUNNABLE 

"C2 CompilerThread1" daemon prio=10 tid=0x70202800 nid=0x5351 runnable [0x00000000] 
    java.lang.Thread.State: RUNNABLE 

"C2 CompilerThread0" daemon prio=10 tid=0x70200800 nid=0x5350 waiting on condition [0x00000000] 
    java.lang.Thread.State: RUNNABLE 

"Signal Dispatcher" daemon prio=10 tid=0x09514c00 nid=0x534f runnable [0x00000000] 
    java.lang.Thread.State: RUNNABLE 

"Finalizer" daemon prio=10 tid=0x09503400 nid=0x534e in Object.wait() [0x70798000] 
    java.lang.Thread.State: WAITING (on object monitor) 
    at java.lang.Object.wait(Native Method) 
    - waiting on <0x76af2ff0> (a java.lang.ref.ReferenceQueue$Lock) 
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118) 
    - locked <0x76af2ff0> (a java.lang.ref.ReferenceQueue$Lock) 
    at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134) 
    at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) 

"Reference Handler" daemon prio=10 tid=0x09501c00 nid=0x534d in Object.wait() [0x707e9000] 
    java.lang.Thread.State: WAITING (on object monitor) 
    at java.lang.Object.wait(Native Method) 
    - waiting on <0x7675cc58> (a java.lang.ref.Reference$Lock) 
    at java.lang.Object.wait(Object.java:485) 
    at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) 
    - locked <0x7675cc58> (a java.lang.ref.Reference$Lock) 

"main" prio=10 tid=0x09482400 nid=0x5349 runnable [0xb6c34000] 
    java.lang.Thread.State: RUNNABLE 
    at net.sourceforge.htmlunit.corejs.javascript.ScriptableObject.getSlot(ScriptableObject.java:2603) 
    at net.sourceforge.htmlunit.corejs.javascript.ScriptableObject.defineProperty(ScriptableObject.java:1699) 
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.configureConstantsPropertiesAndFunctions(JavaScriptEngine.java:350) 
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.configureClass(JavaScriptEngine.java:330) 
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.init(JavaScriptEngine.java:199) 
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.access$000(JavaScriptEngine.java:79) 
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$1.run(JavaScriptEngine.java:146) 
    at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:537) 
    at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:538) 
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.initialize(JavaScriptEngine.java:157) 
    at com.gargoylesoftware.htmlunit.WebClient.initialize(WebClient.java:1141) 
    at com.gargoylesoftware.htmlunit.WebWindowImpl.setEnclosedPage(WebWindowImpl.java:109) 
    at com.gargoylesoftware.htmlunit.html.FrameWindow.setEnclosedPage(FrameWindow.java:102) 
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:200) 
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:179) 
    at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:221) 
    at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:106) 
    at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:433) 
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:311) 
    at com.gargoylesoftware.htmlunit.html.BaseFrame.<init>(BaseFrame.java:73) 
    at com.gargoylesoftware.htmlunit.html.HtmlInlineFrame.<init>(HtmlInlineFrame.java:46) 
    at com.gargoylesoftware.htmlunit.html.DefaultElementFactory.createElementNS(DefaultElementFactory.java:288) 
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.startElement(HTMLParser.java:506) 
    at org.apache.xerces.parsers.AbstractSAXParser.startElement(Unknown Source) 
    at org.cyberneko.html.HTMLTagBalancer.callStartElement(HTMLTagBalancer.java:1136) 
    at org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:742) 
    at org.cyberneko.html.filters.DefaultFilter.startElement(DefaultFilter.java:136) 
    at org.cyberneko.html.filters.NamespaceBinder.startElement(NamespaceBinder.java:278) 
    at org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLScanner.java:2652) 
    at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2022) 
    at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:908) 
    at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:499) 
    at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:452) 
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) 
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.parse(HTMLParser.java:789) 
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:225) 
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:179) 
    at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:221) 
    at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:106) 
    at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:433) 
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:311) 

    <THIS_SECTION_IS_PRINTED_AS_IF_IT_WERE_IN_A_LOOP> 
    at com.gargoylesoftware.htmlunit.html.BaseFrame.loadInnerPageIfPossible(BaseFrame.java:149) 
    at com.gargoylesoftware.htmlunit.html.BaseFrame.loadInnerPage(BaseFrame.java:99) 
    at com.gargoylesoftware.htmlunit.html.HtmlPage.loadFrames(HtmlPage.java:1760) 
    at com.gargoylesoftware.htmlunit.html.HtmlPage.initialize(HtmlPage.java:194) 
    at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:440) 
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:311) 
    </THIS_SECTION_IS_PRINTED_AS_IF_IT_WERE_IN_A_LOOP> 

    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:311) 
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:373) 
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:358) 
    at main.Main.<init>(Main.java:42) 
    at main.Main.main(Main.java:23) 

"VM Thread" prio=10 tid=0x094fe000 nid=0x534c runnable 

"GC task thread#0 (ParallelGC)" prio=10 tid=0x09489800 nid=0x534a runnable 

"GC task thread#1 (ParallelGC)" prio=10 tid=0x0948ac00 nid=0x534b runnable 

"VM Periodic Task Thread" prio=10 tid=0x70207000 nid=0x5353 waiting on condition 

JNI global references: 1234 

我認爲有一些有關自動加載幀的循環。如果是這種情況,是否有任何方法來禁用該行爲來打破循環?

在此先感謝!

+0

您使用的是Java7嗎?如果是,用Java6試用它? –

+0

是: $ java -version java版本「1.6。0_26「 Java(TM)SE運行時環境(內部版本1.6.0_26-b03) Java HotSpot™服務器VM(內部版本20.1-b02,混合模式) –

回答

2

好吧,雖然這是一個可怕的解決方案(解決方法,實際上...),但我終於決定禁用HtmlUnit中的框架自動加載,並由HtmlUnit的開發人員提供諮詢。這是我在細節做的:

  1. 下載化的HtmlUnit源
  2. here
  3. 下載行家評論的HtmlPage類的loadFrames()方法的內容(身體的方法,而不是聲明)位於htmlunit-2.9/src/main/java/com/gargoylesoftware/htmlunit/html
  4. 編譯這個自定義代碼跳過測試用:mvn -Dmaven.test.skip=true clean compile package
  5. 得到位於htmlunit-2.9/artifactshtmlunit-2.9.jar並替換當前htmlunit-2.9.jar庫文件
  6. 這一步可能是最微妙的一步,因爲它取決於每個應用程序。不過,我會告訴你我需要對我的應用程序進行的更改。

你知道我的原代碼是怎麼回事(看問題)。這將下載頁面中的所有框架和iframe。我將如何與幀只裝載您想要的幀獲得頁面的例子:

try { 
    HtmlPage page = webClient.getPage("http://www.w3schools.com/HTML/tryit.asp?filename=tryhtml_noframes"); 
    HtmlInlineFrame frame = page.getFirstByXPath("//iframe[@name='view']"); 
    page = webClient.getPage(page.getFullyQualifiedUrl(frame.getSrcAttribute())); 
    System.out.println(page.asXml()); 
} catch (Exception e) { 
    e.printStackTrace(); 
} 

這個庫的變化後,幀的內容將是空的,一旦getPage()方法完成。注意它不會是空的,看起來它只是返回一個空框架。我們需要做的是手動下載我們感興趣的幀的內容,這就是爲什麼我再次執行getPage()

嗯,這是我設法有選擇地下載框架和iframe與HtmlUnit。任何想法如何改善這將不勝感激。無論如何,我希望將來會增加一些方法來禁止在HtmlUnit本身中加載框架,也許會添加諸如getPage(URL url, boolean downloadFrames)之類的方法。

希望這可以幫助那裏的人!

+0

更新:此解決方法似乎也適用於HtmlUnit 2.10,2.11以及2.12 –

2

當我在瀏覽器中打開此網站時,它並未完成加載頁面。這也可能是爲什麼HtmlUnit崩潰的問題。使用Chrome和FF進行測試。

嘗試加載更簡單的網站,而您可能知道這個崩潰是否依賴網站。

+0

我只在FF 3.6上測試過。當我的電腦加載時掛起我的電腦,但是,考慮到我的HtmlUnit配置中Javascript被禁用,在瀏覽器中禁用它,網站將會加載。一個已知的網頁我需要能夠在沒有人類知識的情況下導航任何鏈接而不需要點擊 –

+0

我正在運行NoScript(因此沒有啓用JavaScript)並且網站永遠加載... 它並不真正掛起,但頁面加載永遠不會結束,在30s加載後停止... –

+0

我注意到頁面在FF6.62秒後完成加載,並執行約700個http請求,HtmlUnit應該能夠這個,但它不。我只需要它返回主頁面的XML,不需要IFRAMES,甚至可以拋出異常或超時等等。但不是當前的行爲:掛起Java進程,吃掉我的CPU並融化硬件:)我認爲像webClient.getPageWithoutFrames(URL)這樣的方法將是解決方案 –