2016-08-04 57 views
-1

我試圖通過使用我的java代碼來獲取某些url的內容。該代碼返回一些網址的內容,例如: 「http://www.nytimes.com/video/world/europe/100000004503705/memorials-for-victims-of-istanbul-attack.html」 ,並且它對於某些其他網站不會返回任何內容。例如這一個: 「http://www.nytimes.com/2016/07/24/travel/mozart-vienna.html?_r=0」 當我手動檢查網址時,我看到了內容,即使我查看源代碼,我也沒有注意到頁面結構之間有任何特別的區別。但我仍然沒有得到這個網址。爲什麼我的Java代碼可以獲取某些url(網頁)的內容?

它涉及到任何權限問題或網頁或我的java代碼的結構?

這裏是我的代碼:

import java.io.BufferedReader; 
import java.io.IOException; 
import java.io.InputStreamReader; 
import java.net.URL; 

public class TestJsoup { 
    public static void main(String[] args) { 
    System.out.println(getUrlParagraphs("http://www.nytimes.com/2016/07/24/travel/mozart-vienna.html?_r=0")); 
} 

public static String getUrlParagraphs (String url) { 
try { 
    URL urlContent = new URL(url); 
    BufferedReader in = new BufferedReader(new InputStreamReader(urlContent.openStream())); 
    String line; 
    StringBuffer html = new StringBuffer(); 
    while ((line = in.readLine()) != null) { 
    html.append(line); 
    System.out.println("Test"); 
    } 
    in.close(); 
    System.out.println(html.toString()); 
    return html.toString(); 
} catch (IOException e) { 
    e.printStackTrace(); 
} 
return null; 
} 
} 

回答

0

這是因爲第二個重定向,你不要試圖跟隨重定向。

嘗試用curl -v訪問它:

$ curl -v 'http://www.nytimes.com/2016/07/24/travel/mozart-vienna.html?_r=0' 
* Hostname was NOT found in DNS cache 
* Trying 170.149.161.130... 
* Connected to www.nytimes.com (170.149.161.130) port 80 (#0) 
> GET /2016/07/24/travel/mozart-vienna.html?_r=0 HTTP/1.1 
> User-Agent: curl/7.35.0 
> Host: www.nytimes.com 
> Accept: */* 
> 
< HTTP/1.1 303 See Other 
* Server Varnish is not blacklisted 
< Server: Varnish 
< Location: http://www.nytimes.com/glogin?URI=http%3A%2F%2Fwww.nytimes.com%2F2016%2F07%2F24%2Ftravel%2Fmozart-vienna.html%3F_r%3D1 
< Accept-Ranges: bytes 
< Date: Thu, 04 Aug 2016 08:45:53 GMT 
< Age: 0 
< X-API-Version: 5-0 
< X-PageType: article 
< Connection: close 
< X-Frame-Options: DENY 
< Set-Cookie: RMID=007f0101714857a300c1000d;Path=/; Domain=.nytimes.com;Expires=Fri, 04 Aug 2017 08:45:53 UTC 
< 
* Closing connection 0 

你可以看到有沒有內容,這是一個3XX返回代碼,並具有Location:頭。

+0

謝謝安迪!你是對的!這是一個重定向的url,當我想在瀏覽器中打開重定向的url時,我必須輸入用戶名和密碼,然後才能看到該頁面。我知道,我如何在我的java代碼中獲取重定向的代碼,但我不知道如何傳遞「用戶,密碼」步驟並獲取內容。你有什麼想法嗎?我可以簡單地添加我的用戶並傳遞給重定向的鏈接?! – Simone

0

你好, 問題是在您的網址,我想你的代碼在我的機器,它也返回null,但我閱讀Oracle文檔一下,發現問題是主人,所以如果你改變網址(例如這篇文章鏈接)它會正常工作。我的代碼在這裏

package sd.nctr.majid; 
import java.io.BufferedReader; 
import java.io.IOException; 
import java.io.InputStreamReader; 
import java.net.URL; 

public class Program { 

    public static void main(String[] args) { 
     System.out.println(getUrlParagraphs("http://stackoverflow.com/questions/4328711/read-url-to-string-in-few-lines-of-java-code")); 

    } 

    public static String getUrlParagraphs (String url) { 
     try { 
      URL urlContent = new URL(url); 
      BufferedReader in = new BufferedReader(new InputStreamReader(urlContent.openStream())); 
      String line; 
      StringBuffer html = new StringBuffer(); 
      while ((line = in.readLine()) != null) { 
      html.append(line); 
      System.out.println("Test"); 
      } 
      in.close(); 
      System.out.println(html.toString()); 
      return html.toString(); 
     } catch (IOException e) { 
      e.printStackTrace(); 
     } 
     return null; 
     } 
} 
相關問題