抓取網頁編碼問題 - 字節中的負值

我使用以下代碼來抓取網頁。抓取網頁編碼問題 - 字節中的負值

CloseableHttpClient httpclient = HttpClients.createDefault(); 
HttpGet httpget = new HttpGet(url); 
CloseableHttpResponse response = httpclient.execute(httpget); 
HttpEntity entity = response.getEntity(); 
System.out.println(entity.getContentType()); 
//output: Content-Type: text/html; charset=ISO-8859-1

我發現，字符「」」具有字節值-110，這是不能被映射到在任一ISO-8859-1或UTF-8有效字符。

我嘗試手動打開網頁並複製文字和保存爲文本文件，然後我看到了字節值實際上是39. 我覺得OS做轉換時的字符通過剪貼板了

我想要的只是將網頁保存爲原始的本地磁盤。

我做了一個簡單的代碼來保存內容到磁盤。我直接讀取字節和寫入字節。當我用十六進制編輯器打開保存的文件時，我可以看到該字節的值是146（-110）。

InputStream in = entity.getContent(); 
FileOutputStream fos = new FileOutputStream(new File("D:/test.html")); 

byte[] buffer = new byte[1024]; 
int len = 0; 
while((len = in.read(buffer)) > 0) { 
    fos.write(buffer, 0, len); 
    buffer = new byte[1024]; 
} 
in.close(); 
fos.close();

所以現在問題變成如何從字節146（-110）重建字符。如果我有任何問題，我會繼續嘗試和更新。

來源

2014-09-06 David

你能提供有問題用「」」的文字代碼？如果不一致，您使用的代碼將網頁保存到磁盤。 [mvce]（http://stackoverflow.com/help/mcve） – NiematojakTomasz 2014-09-06 19:10:48

也許你可以給你一些代碼如何將頁面保存到磁盤？你有沒有檢查’的值？它看起來像字符’是3個字節長，除非我粘貼或複製失敗。檢查了這一點：

public static void main(String[] args) { 
    char c = '’'; 
    System.out.println("character: " + c); 
    System.out.println("int: " + (int)c); 
    String s = new String("’"); 
    // Java uses UTF-16 encoding, other encodings will give different values 
    byte[] bytes = s.getBytes(); 
    System.out.println("bytes: " + Arrays.toString(bytes)); 
}

編輯：我發現了以下建議的方法來處理字符集，不妨一試：

ContentType contentType = ContentType.getOrDefault(entity); 
    Charset charset = contentType.getCharset(); 
    Reader reader = new InputStreamReader(entity.getContent(), charset);

來源：https://hc.apache.org/httpcomponents-client-ga/tutorial/html/fundamentals.html

來源

2014-09-06 17:56:26 MirMasej

Java中的字節是帶符號的類型，值爲-128至127.最高有效位用於指示符號。例如，0111 1111 == 127和1000 0000 == -128。

我在ANSI表中查找了您的字符（'），發現它的值爲146（當然這大於127）。二進制表示是1001 0010，因此將其解釋爲有符號值將產生-110。

重現您所看到的：

String s = new String("’");   // ’ is ansi character 146 
byte[] bytes = s.getBytes();   
System.out.println((int)bytes[0]); // prints -110

的字節值轉換爲無符號的表示：

char c = (char)(bytes[0] & 0xFF); 
System.out.println((int)c);   // prints 146

來源

2014-09-06 18:52:04 trooper

抓取網頁編碼問題 - 字節中的負值

回答

相關問題