Introduction

Parquet does not always store a column’s values directly. When a column repeats many values, dictionary encoding stores the distinct values once and represents the column as a stream of small integer indices into that dictionary. Those indices are bit-packed, and reading the column requires unpacking them back into an int[] before any value can be looked up.

This explainer walks through that representation and the unpacking step. It is the background companion to Unpacking Parquet: Explicit SIMD, Scalar Baselines, and What HotSpot Makes of Them, which benchmarks and analyzes the unpacking step in depth.

Dictionary encoding

A Parquet data page does not always contain the original column values directly.

Consider a column of city names:

Berlin
Rome
Berlin
Paris
Rome
Berlin

With dictionary encoding, Parquet stores the distinct values once and then refers to them by index:

0 → Berlin
1 → Rome
2 → Paris

Instead of repeating the city names, the encoded value stream contains their dictionary indices:

0, 1, 0, 2, 1, 0

The original values may be strings, large integers, or any other supported type. The indices are just compact identifiers that point into the dictionary.

Bit-packing the indices

The indices need only as many bits as the dictionary size requires. Because the dictionary above has three entries, two bits are enough for each index. The same mechanism scales: if a dictionary holds up to 8192 entries, its indices run from 0 to 8191 and need 13 bits each.

2^13 = 8192

Those indices are packed one after another, without aligning each one to a byte boundary:

index 0       index 1       index 2
13 bits       13 bits       13 bits
───────────── ───────────── ─────────────

The first index may start at bit 0, the second at bit 13, the third at bit 26, and so on. This packing is what makes the representation compact, and it is also what makes reading it back non-trivial: a value rarely starts on a byte boundary, and it may straddle two bytes.

The encoded-to-decoded boundary

Reading the page runs the process in reverse:

bit-packed dictionary indices
bit unpacking
int[] of dictionary indices
dictionary lookup
original column values

The first transition is the one this background is about:

bit-packed dictionary indices
bit unpacking
int[] of dictionary indices

Each packed index has to be located, extracted, widened to a Java integer, and written into an output array, before the dictionary lookup can reconstruct the original values. This operation is called bit unpacking.

From packed bits to integers

Reconstructing the int[] of indices reverses the packing: a compact bit stream has to become an array of Java integers again.

To decode one value, a program generally has to:

  1. determine where the value starts in the bit stream;
  2. load enough bytes to contain it;
  3. shift away the bits that precede it;
  4. apply a mask to retain only the relevant bits;
  5. store the result as an integer.

A simplified scalar decoder looks like this:

long bitPosition = 0;
for (int i = 0; i < count; i++) {
    long bytePosition = bitPosition >>> 3;
    int bitOffset = (int) (bitPosition & 7);
    long word = source.get(LONG_LE, bytePosition);
    output[i] = (int) ((word >>> bitOffset) & mask);
    bitPosition += bitWidth;
}

The decoder processes one value at a time. At a width of 13 bits, the bit positions advance by a fixed increment, so the pattern is perfectly regular; but it is not a constant-stride, element-by-element array operation, and that is the distinction a compiler’s auto-vectorizer cares about.

Where this fits

Bit unpacking is a small, hot kernel: a Parquet reader runs it for every dictionary-encoded page before it can look anything up. How fast it runs, and how much an explicit SIMD implementation helps over ordinary scalar Java, depends on the exact shape of that loop and on what the JVM’s compiler makes of it. That is the subject of Unpacking Parquet: Explicit SIMD, Scalar Baselines, and What HotSpot Makes of Them.