使用Java URL的Unicode字符解析維基百科URL時出錯

我無法獲得包括unicode在內的維基百科網址！使用Java URL的Unicode字符解析維基百科URL時出錯

給定一個網頁的標題，如：1992年\ u201393_UE_Lleida_seasonnow

只是簡單的網址... http://en.wikipedia.org/wiki/1992 \ u201393_UE_Lleida_seasonnow

使用URLEncoder的（設置爲UTF-8）.... http://en.wikipedia.org/wiki/1992%5Cu201393_UE_Lleida_seasonnow

當我嘗試解決任何網址時，我什麼都沒有。如果我將url複製到瀏覽器中，我什麼也得不到 - 只有當我實際上覆制unicode字符，才能獲得頁面。

維基百科有一些奇怪的方式來編碼unicode的網址？或者我只是在做一些愚蠢的事情？

下面是我使用的代碼：

URL url = new URL("http://en.wikipedia.org/wiki/"+x); 
System.out.println("trying "+url); 

// Attempt to open the wiki page 
InputStream is; 
     try{ is = url.openStream(); 
} catch(Exception e){ return null; }

來源

2011-06-04 nflacco

正確的URI是http://en.wikipedia.org/wiki/2009%E2%80%9310_UE_Lleida_season。

許多瀏覽器顯示文字而不是percent-encoded轉義序列。這被認爲更加用戶友好。但是，正確編碼的URI必須使用在path part不允許的字符編碼百分比：

path   = path-abempty ; begins with "/" or is empty 
       /path-absolute ; begins with "/" but not "//" 
       /path-noscheme ; begins with a non-colon segment 
       /path-rootless ; begins with a segment 
       /path-empty  ; zero characters 
    path-abempty = *("/" segment) 
    path-absolute = "/" [ segment-nz *("/" segment) ] 
    path-noscheme = segment-nz-nc *("/" segment) 
    path-rootless = segment-nz *("/" segment) 
    path-empty = 0<pchar> 
    segment  = *pchar 
    segment-nz = 1*pchar 
    segment-nz-nc = 1*(unreserved/pct-encoded/sub-delims/"@") 
       ; non-zero-length segment without any colon ":" 
    pchar   = unreserved/pct-encoded/sub-delims/":"/"@" 
    pct-encoded = "%" HEXDIG HEXDIG 
    unreserved = ALPHA/DIGIT/"-"/"."/"_"/"~" 
    sub-delims = "!"/"$"/"&"/"'"/"("/")" 
       /"*"/"+"/","/";"/"="

的URI class可以幫助您與這些序列：

人物在其他類是允許的無論RFC 2396允許轉義字節，即用戶信息，路徑，查詢和片段組件以及授權組件（如果授權是基於註冊表的），都可以。這允許URI包含除US-ASCII字符集以外的Unicode字符。

String literal = "http://en.wikipedia.org/wiki/1992\u201393_UE_Lleida_seasonnow"; 
URI uri = new URI(literal); 
System.out.println(uri.toASCIIString());

你可以閱讀更多關於URI編碼here。

來源

2011-06-04 09:04:34 McDowell

維基百科有一些奇怪的方式來編碼unicode的網址？

這並不奇怪，它的標準使用IRI s。在IRI：

http://en.wikipedia.org/wiki/2009–10_UE_Lleida_season

，其中包括一個Unicode短破折號，相當於URI：

http://en.wikipedia.org/wiki/2009%E2%80%9310_UE_Lleida_season

您可以在鏈接IRI形式，它會在現代瀏覽器。但是許多網絡庫（包括Java和舊瀏覽器）都只需要ASCII碼。（即使您使用編碼的URI版本鏈接到現代瀏覽器，仍然會在地址欄中顯示漂亮的IRI版本。）

要將IRI轉換爲URI，通常在主機名上使用IDN算法，並將任何其他非ASCII字符用URL編碼爲UTF-8字節。在你的情況下，它應該是：

String urlencoded= URLEncoder.encode(x, "utf-8").replace("+", "%20"); 
URL url= new URL("http://en.wikipedia.org/wiki/"+urlencoded);

注：更換+與%20是必須使x值與工作空間。 URLEncoder確實application/x-www-form-urlencoded -encoding在查詢字符串中使用。但在像這樣的路徑URL段中，不適用於 - 平均空間規則。路徑中的空間必須使用普通URL編碼進行編碼，編碼爲%20。

然後......在維基百科的特定情況下，爲了便於閱讀，他們用空白替換空格，所以您最好直接用"_"替換"+"。 %20版本仍然可以工作，因爲它們會從那裏重定向到下劃線版本。

來源

2011-06-04 11:39:06 bobince

下面是編碼使用Unicode URL，這樣就可以使用HttpURLConnection類來檢索他們一個簡單的算法：

import static org.junit.Assert.*; 

import java.net.URLEncoder; 

import org.apache.commons.lang.CharUtils; 
import org.junit.Test; 

public class InternationalURLEncoderTest { 

    static String encodeUrl(String urlToEncode) { 
     String[] pathSegments = urlToEncode.split("((?<=/)|(?=/))"); 
     StringBuilder encodedUrlBuilder = new StringBuilder(); 
     for (String pathSegment : pathSegments) { 
      boolean needsEncoding = false; 
      for (char ch : pathSegment.toCharArray()) { 
       if (!CharUtils.isAscii(ch)) { 
        needsEncoding = true; 
        break; 
       } 
      } 
      String encodedSegment = needsEncoding ? URLEncoder 
        .encode(pathSegment) : pathSegment; 
      encodedUrlBuilder.append(encodedSegment); 
     } 
     return encodedUrlBuilder.toString(); 
    } 

    @Test 
    public void test() { 
     assertEquals(
       "http://www.chinatimes.com/realtimenews/%E5%8D%97%E6%8A%95%E4%B8%80%E8%90%AC%E5%A4%9A%E6%88%B6%E5%A4%A7%E5%81%9C%E9%9B%BB-%E4%B9%9D%E6%88%90%E4%BB%A5%E4%B8%8A%E6%81%A2%E5%BE%A9%E4%BE%9B%E9%9B%BB-20130603003259-260401", 
       encodeUrl("http://www.chinatimes.com/realtimenews/南投一萬多戶大停電-九成以上恢復供電-20130603003259-260401")); 
     assertEquals("http://www.ttv.com.tw/", 
       encodeUrl("http://www.ttv.com.tw/")); 
     assertEquals("http://www.ttv.com.tw", 
       encodeUrl("http://www.ttv.com.tw")); 
     assertEquals("http://www.rt-drive.com.tw/shopping/?st=16", 
       encodeUrl("http://www.rt-drive.com.tw/shopping/?st=16")); 
    } 

}

該算法使用上string splitting和這些答案detecting Unicode characters

來源

2013-06-03 06:42:57 chi

這裏有一個簡單的方法寫編碼的URL志的回答是：

static String encodeUrl(String urlToEncode) throws URISyntaxException { 
    return new URI(urlToEncode).toASCIIString(); 
}

澄清見this answer 。

來源

2013-08-21 02:19:18

使用Java URL的Unicode字符解析維基百科URL時出錯

回答

相關問題