1
提取源我試圖從網站上讀取源代碼使用下面的代碼從HTML頁面
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
public class GrabHTML {
public static void Connect() throws Exception{
//Set URL
URL url = new URL("http://www.google.ca/");
URLConnection spoof = url.openConnection();
//Spoof the connection so we look like a web browser
spoof.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)");
BufferedReader in = new BufferedReader(new InputStreamReader(spoof.getInputStream()));
String strLine = "";
//Loop through every line in the source
while ((strLine = in.readLine()) != null){
//Prints each line to the console
System.out.println(strLine);
}
System.out.println("End of page.");
}
public static void main(String[] args){
try{
//Calling the Connect method
Connect();
}catch(Exception e){
}
}
}
但它只能讀取源代碼的某些部分。當我從瀏覽器中看到「查看源代碼」時,Google.com有更多數據。
http://jsoup.org/這裏值得一提 – 2013-02-14 15:19:09
這段代碼適合我。我懷疑用戶代理屬性與您的瀏覽器不匹配,因此Google網站在每種情況下的服務稍有不同。 – 808sound 2013-02-14 16:00:27