尋找關於Jeff的幻燈片

中提出的「組varint編碼/解碼」的更多細節我注意到，在Jeff的幻燈片「構建大規模信息檢索系統的挑戰」中，也可以在這裏下載：http://research.google.com/people/jeff/WSDM09-keynote.pdf，整數方法稱爲「組織varint編碼」的壓縮被提及。據說比每字節整數編碼7位（2倍多）要快得多。我對此非常感興趣，希望能夠實現這一點，或者有更多的細節可以幫助我自己來實現。尋找關於Jeff的幻燈片

我不是一個親和新來的，歡迎任何幫助！

來源

2010-06-06 Mickey Shine

這是指「可變整數編碼」，其中用於存儲整數的位數在串行化時不固定在4個字節。有一個很好的描述varint in the protocol buffer documentation。

它用於編碼Google's protocol buffers，您可以瀏覽protocol buffer source code。

的CodedOutputStream包含精確編碼功能WriteVarint32FallbackToArrayInline：

inline uint8* CodedOutputStream::WriteVarint32FallbackToArrayInline(
    uint32 value, uint8* target) { 
    target[0] = static_cast<uint8>(value | 0x80); 
    if (value >= (1 << 7)) { 
    target[1] = static_cast<uint8>((value >> 7) | 0x80); 
    if (value >= (1 << 14)) { 
     target[2] = static_cast<uint8>((value >> 14) | 0x80); 
     if (value >= (1 << 21)) { 
     target[3] = static_cast<uint8>((value >> 21) | 0x80); 
     if (value >= (1 << 28)) { 
      target[4] = static_cast<uint8>(value >> 28); 
      return target + 5; 
     } else { 
      target[3] &= 0x7F; 
      return target + 4; 
     } 
     } else { 
     target[2] &= 0x7F; 
     return target + 3; 
     } 
    } else { 
     target[1] &= 0x7F; 
     return target + 2; 
    } 
    } else { 
    target[0] &= 0x7F; 
    return target + 1; 
    } 
}

級聯if旨意只添加附加字節到target數組的末尾如果value權證那些額外的字節大小。 0x80掩碼正在寫入的字節，並且value向下移位。從我所知道的，0x7f掩碼使它表示「編碼的最後一個字節」。（當OR' 0x80，最高位總是1，那麼最後一個字節會清除最高位（通過AND'ing 0x7f）。因此，在讀取varints時，您必須先閱讀，直到獲得最高位爲零的字節。

我才意識到你問「集團VarInt編碼」特別對不起，該代碼是關於基本VarInt編碼（仍快於7位），其基本思想看起來是相似的，不幸的是，這是不什麼是用來在協議緩衝區中存儲64位數字我不會感到驚訝，如果代碼是開源的地方

使用從varint和「組varint「，它不應該太難以自己製作:)

這是描述Group VarInt compression的另一個頁面，其中包含解碼代碼。不幸的是，他們暗示公開可用的實現，但他們不提供參考。

void DecodeGroupVarInt(const byte* compressed, int size, uint32_t* uncompressed) { 
    const uint32_t MASK[4] = { 0xFF, 0xFFFF, 0xFFFFFF, 0xFFFFFFFF }; 
    const byte* limit = compressed + size; 
    uint32_t current_value = 0; 
    while (compressed != limit) { 
    const uint32_t selector = *compressed++; 
    const uint32_t selector1 = (selector & 3); 
    current_value += *((uint32_t*)(compressed)) & MASK[selector1]; 
    *uncompressed++ = current_value; 
    compressed += selector1 + 1; 
    const uint32_t selector2 = ((selector >> 2) & 3); 
    current_value += *((uint32_t*)(compressed)) & MASK[selector2]; 
    *uncompressed++ = current_value; 
    compressed += selector2 + 1; 
    const uint32_t selector3 = ((selector >> 4) & 3); 
    current_value += *((uint32_t*)(compressed)) & MASK[selector3]; 
    *uncompressed++ = current_value; 
    compressed += selector3 + 1; 
    const uint32_t selector4 = (selector >> 6); 
    current_value += *((uint32_t*)(compressed)) & MASK[selector4]; 
    *uncompressed++ = current_value; 
    compressed += selector4 + 1; 
    } 
}

來源

2010-06-06 03:06:13 Stephen

非常感謝，期待您的更新 – 2010-06-06 03:10:32

更新。順便說一句，在演示中找到好的，我會稍後再檢查一下。:) – Stephen 2010-06-06 03:20:10

你很快，我會盡快檢查:) – 2010-06-06 03:33:05

我一直在尋找同樣的東西，發現在Java中，這GitHub的項目： https://github.com/stuhood/gvi/ 看起來有前途的！

來源

2011-11-07 12:23:44 nobre

非常感謝〜 – 2012-01-10 14:54:13

有組+2元素時有點小錯誤，我在問題部分發布了修復。否則，它非常快速和高效！即使在少量位上，我通常也會經歷65-70％的壓縮（我的數據是連續的，並且增量編碼僅存儲每個數字之間的差異，對於組var int完美的輸入） – nobre 2012-01-11 11:39:44

而是與位掩碼進行解碼，在C/C++，你可以使用對應於第一個字節的值預定義的結構..使用這個完整的例子：http://www.oschina.net/code/snippet_12_5083

來源

2012-01-02 11:38:46 Simon

代碼有點長我得看看，謝謝你的回覆 – 2012-01-11 03:13:55

另一個Java實現groupvarint：https://github.com/catenamatteo/groupvarint 但我懷疑非常大的交換機在Java中有一些缺點

來源

2014-09-22 10:56:16 Matteo

尋找關於Jeff的幻燈片

回答

相關問題