從HTML頁面

提取源我試圖從網站上讀取源代碼使用下面的代碼從HTML頁面

import java.io.BufferedReader; 
import java.io.InputStreamReader; 
import java.net.URL; 
import java.net.URLConnection; 

public class GrabHTML { 

public static void Connect() throws Exception{ 

//Set URL 
URL url = new URL("http://www.google.ca/"); 
URLConnection spoof = url.openConnection(); 

//Spoof the connection so we look like a web browser 
spoof.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)"); 
BufferedReader in = new BufferedReader(new InputStreamReader(spoof.getInputStream())); 
String strLine = ""; 

//Loop through every line in the source 
    while ((strLine = in.readLine()) != null){ 

//Prints each line to the console 
    System.out.println(strLine); 
    } 

System.out.println("End of page."); 
} 

public static void main(String[] args){ 

try{ 
    //Calling the Connect method 
    Connect(); 
}catch(Exception e){ 

} 
} 
}

但它只能讀取源代碼的某些部分。當我從瀏覽器中看到「查看源代碼」時，Google.com有更多數據。

來源

2013-02-14 Binish John

http://jsoup.org/這裏值得一提 – 2013-02-14 15:19:09

這段代碼適合我。我懷疑用戶代理屬性與您的瀏覽器不匹配，因此Google網站在每種情況下的服務稍有不同。 – 808sound 2013-02-14 16:00:27

刪除下面的語句

spoof.setRequestProperty（「用戶代理」，「Mozilla的/ 4.0（兼容; MSIE 5.5 ;的Windows NT 5.0; H010818）」）;

來源

2013-02-14 15:57:20

回答

相關問題