Unpacking Parquet: Where Explicit SIMD Actually Matters

Introduction

On the JVM, optimizing a hot kernel is not only about writing faster code: it is about controlling how much the result depends on the compiler recognizing the code’s shape. Using Parquet bit-unpacking as a concrete case, our experiment shows that a SIMD speedup depends on which scalar baseline C2 is handed, when explicit vectorization is actually justified, and why a more specialized scalar routine is not necessarily faster.

The question explored is:

When does an explicitly vectorized bit-unpacking routine provide a real advantage over ordinary Java code optimized by HotSpot?

This is not only a question about Parquet decoding. It is also a question about how SIMD optimizations on the JVM should be evaluated.

A loop written as ordinary scalar Java does not necessarily stay scalar. HotSpot may recognize its structure and emit SIMD instructions, or leave it scalar when the access pattern is harder to prove.

The scalar baseline is therefore not a neutral reference point. It is part of the result.

Let’s start from the very beginning. If you are already familiar with the material, feel free to skip ahead.

From Parquet dictionary encoding to bit unpacking

Dictionary indices

A Parquet data page does not always contain the original column values directly.

Consider a column of city names:

Berlin
Rome
Berlin
Paris
Rome
Berlin

With dictionary encoding, Parquet stores the distinct values once and then refers to them by index:

0 → Berlin
1 → Rome
2 → Paris

Instead of repeating the city names, the encoded value stream contains their dictionary indices:

0, 1, 0, 2, 1, 0

Because this dictionary has three entries, two bits are enough for each index. The same mechanism scales: if a dictionary holds up to 8192 entries, its indices run from 0 to 8191 and need 13 bits each.

2^13 = 8192

Those indices are packed one after another, without aligning each one to a byte boundary:

index 0       index 1       index 2
13 bits       13 bits       13 bits
───────────── ───────────── ─────────────

The first index may start at bit 0, the second at bit 13, the third at bit 26, and so on. The original values may be strings, large integers, or any other supported type: the 13-bit values here are not the column’s values, but the compact identifiers that point into the dictionary.

The encoded-to-decoded boundary

Reading the page runs the process in reverse:

bit-packed dictionary indices
            ↓
bit unpacking
            ↓
int[] of dictionary indices
            ↓
dictionary lookup
            ↓
original column values

This article isolates the first transition:

bit-packed dictionary indices
            ↓
bit unpacking
            ↓
int[] of dictionary indices

Each packed index has to be located, extracted, widened to a Java integer, and written into an output array, before the dictionary lookup can reconstruct the original values. This operation is called bit unpacking.

From packed bits to integers

Reconstructing the int[] of indices reverses the packing: a compact bit stream has to become an array of Java integers again.

To decode one value, a program generally has to:

determine where the value starts in the bit stream;
load enough bytes to contain it;
shift away the bits that precede it;
apply a mask to retain only the relevant bits;
store the result as an integer.

A simplified scalar decoder looks like this:

long bitPosition = 0;
for (int i = 0; i < count; i++) {
    long bytePosition = bitPosition >>> 3;
    int bitOffset = (int) (bitPosition & 7);
    long word = source.get(LONG_LE, bytePosition);
    output[i] = (int) ((word >>> bitOffset) & mask);
    bitPosition += bitWidth;
}

The decoder processes one value at a time.

At a width of 13 bits, the bit positions advance by a fixed increment, so the pattern is perfectly regular; but it is not a constant-stride, element-by-element array operation, and that is the distinction the compiler cares about.

How Java reaches SIMD

Scalar execution and SIMD

A scalar loop processes one value after another:

value 0 → shift → mask → store
value 1 → shift → mask → store
value 2 → shift → mask → store
...

Modern processors can also apply one operation to several values at once. This is SIMD: Single Instruction, Multiple Data.

For example, on the machine used for this experiment, a 512-bit vector register can hold sixteen 32-bit integer lanes:

┌────┬────┬────┬────┬────┬────┬────┬────┐
│ v0 │ v1 │ v2 │ v3 │ .. │v13 │v14 │v15 │
└────┴────┴────┴────┴────┴────┴────┴────┘

A vector shift can operate on all those lanes. The same applies to masking and other arithmetic operations.

Java has two different ways to reach such instructions.

Two ways Java reaches vector instructions

The first route is auto-vectorization.

HotSpot contains an optimizing just-in-time compiler called C2. When a method becomes sufficiently hot, C2 compiles its bytecode into optimized machine code.

One of its optimization passes, called SuperWord, looks for isomorphic, independent scalar operations that can be packed into SIMD operations, provided their memory dependencies and access patterns can be proven safe.

Ordinary scalar-looking Java code can therefore become vector machine code.

For example:

for (int i = 0; i < count; i++) {
    output[i] = source[i] & 0xFF;
}

The programmer has written a scalar loop. Each source-level iteration processes one byte.

C2 may nevertheless recognize that the iterations are independent and access memory in a regular stride, and emit machine code that processes several elements at once:

ordinary scalar Java loop
        ↓
C2 and SuperWord analyze the loop
        ↓
SIMD machine instructions, if recognition succeeds

The second route is the Vector API.

With the Vector API, the programmer expresses the vector computation directly. On supported hardware, and for operations the target can map efficiently, HotSpot lowers it to the matching SIMD instructions; otherwise the API stays functional but may fall back to a less efficient form:

Vector API operations
        ↓
HotSpot lowers the explicit vector computation
        ↓
SIMD instructions when supported efficiently

The Vector API did not introduce SIMD to Java. HotSpot had already been capable of generating vector instructions from suitable scalar loops for many years.

The difference is where the vector structure comes from. With auto-vectorization, C2 must infer that structure from scalar code. With the Vector API, the structure is explicit in the program.

This distinction leads to the central question:

When can C2 recover an efficient vector form automatically, and when does that form need to be expressed explicitly through the Vector API?

Why loop shape matters

C2 does not vectorize every loop. It must be able to prove that the transformation is safe and recognize a pattern that it knows how to combine.

A byte-aligned conversion is comparatively simple:

for (int i = 0; i < count; i++) {
    output[i] = source[i] & 0xFF;
}

Every iteration reads the next byte, applies the same mask, and writes the next integer:

source[0] → output[0]
source[1] → output[1]
source[2] → output[2]

The memory accesses have a fixed stride. The iterations are independent. The operation is identical for every element.

This is the kind of loop SuperWord can recognize.

Bit unpacking at 13 bits has a different shape.

Successive values begin at changing positions:

value 0 starts at bit 0
value 1 starts at bit 13
value 2 starts at bit 26
value 3 starts at bit 39

The scalar decoder calculates a different byte position and bit offset for every value:

long bytePosition = bitPosition >>> 3;
int bitOffset = (int) (bitPosition & 7);
long word = source.get(LONG_LE, bytePosition);
output[i] = (int) ((word >>> bitOffset) & mask);
bitPosition += bitWidth;

A programmer can see that several values could still be extracted in parallel.

C2, however, sees:

loads from positions calculated at runtime;
shifts whose offsets vary between iterations;
values that may cross byte boundaries;
relationships between neighboring bytes that are not expressed as direct array operations.

On the tested system, SuperWord does not reconstruct a SIMD bit-unpacker from this scalar form.

The explicit Vector API kernel avoids that recognition problem.

The explicit vector kernel

The vector implementation processes a block of packed values in several stages:

load packed bytes
        ↓
rearrange bytes into vector lanes
        ↓
shift each lane by its required offset
        ↓
apply the bit mask
        ↓
store decoded integers

In simplified Vector API form:

ByteVector.fromMemorySegment(
            BYTE_SPECIES,
            source,
            baseByte,
            LITTLE_ENDIAN)
        .rearrange(shuffle)
        .reinterpretAsInts()
        .lanewise(LSHR, shiftVector)
        .lanewise(AND, maskVector)
        .intoArray(output, outputOffset);

Three values depend on the bit width:

the byte shuffle;
the shift applied to each lane;
the mask.

They are computed for the selected width and reused.

The sequence of operations does not otherwise change for widths up to 25 bits:

load → rearrange → shift → mask → store

At 5, 9, 13, 17, or 25 bits, the parameters change, but the kernel still performs the same classes of operation.

Its computational shape is therefore largely independent of the width.

That is the structural reason the vector cost stays nearly flat across widths, which the measurements confirm later.

Why the baseline matters

Because C2 can auto-vectorize some ordinary Java loops, the term “scalar baseline” is ambiguous.

Two loops may both be written as scalar source code while producing very different machine code:

scalar Java source
        ↓
C2 recognizes the shape
        ↓
SIMD machine code

or:

scalar Java source
        ↓
C2 does not recognize the shape
        ↓
scalar machine code

Comparing the Vector API only with the second case makes explicit SIMD appear overwhelmingly superior.

Comparing it with the first case shows how much advantage remains when the compiler already reaches SIMD automatically.

For this reason, the experiment uses several baselines.

Positioning the experiment

Relation to existing work

Vectorized integer unpacking is established work.

Lemire and Boytsov demonstrated that blocks of compressed integers can be decoded at very high throughput using explicit SIMD implementations. Their work helped establish a family of techniques based on fixed-width blocks, width-specific decoding strategies, vector rearrangement, shifts, and masks.

PARQUET-2159 brought this line of optimization directly into Parquet Java. Its benchmarks compared the existing Parquet unpacking implementation with routines written using the Java Vector API. The resulting work became the basis of Parquet Java’s optional vector implementation.

What this experiment measures differently

This article does not introduce a new unpacking technique. It instead examines how the measured advantage of explicit vectorization changes with the scalar baseline against which it is evaluated.

Prior work mainly establishes that explicit SIMD unpacking can outperform scalar implementations. Here, the baseline is treated as part of the result. The comparison includes regular scalar forms for which compiler optimization materially affects the measurements, and irregular forms for which SuperWord provides no measurable benefit. This reveals how much of the apparent SIMD advantage comes from the explicit kernel and how much comes from the scalar loop against which it is compared.

The experiment also examines Parquet Java’s specialized scalar routines separately from its vectorized path. On this host and JDK, specialization does not improve scalar throughput monotonically: at some wide, irregular widths, the specialized routines retire more instructions than the generic decoder.

The contribution is therefore not a faster unpacker.

It is a measurement of the interaction between baseline selection, scalar specialization, HotSpot compiler recognition, and explicit vector execution.

Experimental setup

Benchmark environment

The measurements come from a controlled, single-threaded microbenchmark on one machine:

AMD Ryzen 9 7950X3D
Zen 4
x86-64 Linux, little-endian
AVX-512 with VBMI/VBMI2
OpenJDK 25.0.3
Vector API, tenth incubator
512-bit preferred vector species

This configuration satisfies the platform and CPU-feature checks applied by ParquetReadRouter in the Parquet Java version examined here.

The benchmark measures in-cache decoding kernels.

Scope and limitations

It does not measure the complete cost of reading a Parquet file.

A real Parquet scan may also include:

storage access;
page reads;
decompression;
the dictionary lookup (gather into the dictionary page);
null handling;
filtering;
memory allocation;
scheduling;
downstream execution;
memory-bandwidth contention.

In a complete query, these costs may dominate bit unpacking.

The purpose of the microbenchmark is not to predict total query speed. It is to isolate the encoded-to-decoded boundary and make its computational shape measurable. The benchmark stops once it has produced the int[] of decoded indices; it does not include the subsequent dictionary lookup or the materialization of the original column values.

The results are also specific to one processor and one JDK.

A different machine or compiler version could change:

the absolute throughput;
the widths C2 can auto-vectorize;
the effectiveness of the specialized scalar routines;
the magnitude or nature of the width-26 transition;
the behavior of SuperWord on the widening loop.

The role of MemorySegment

Before comparing baselines, one possible confound must be removed. The experiment reads its input through a MemorySegment.

A MemorySegment represents a spatially bounded region of memory, with a lifetime determined by its scope. It can refer to heap or off-heap memory and is part of the Foreign Function & Memory API finalized in JDK 22. Native segments can be managed through an explicit arena; heap-backed segments stay tied to their underlying Java object.

It would be easy to assume that MemorySegment is responsible for a large part of the measured speedup.

It is not.

In the benchmark, a scalar decoder reading from a heap byte[] and an equivalent decoder reading from an off-heap MemorySegment produced almost identical throughput. Both processed approximately 2,000 values per microsecond.

This result is unsurprising. The two implementations use the same extraction algorithm, perform the same bit arithmetic, and operate on data that fits in cache. Only the source of the load changes.

MemorySegment is therefore not the performance result; its role is architectural. It supplies a bounded memory abstraction from which the Vector API can load directly, without requiring an intermediate Java array.

It is the substrate of the vector implementation, not its explanation.

Input padding and boundary handling

The decoders use fixed-width loads that may cross the logical end of the encoded input when processing its final values. The scalar decoders load one 64-bit word per value, while the main vector kernel loads a 64-byte ByteVector per block; its remaining values are handled by a scalar fallback.

To keep these loads within the allocated memory region without introducing a special end-of-buffer branch into the measured kernels, the benchmark appends 64 zero-filled bytes after the encoded data. The padding is sized for the widest load, the vector block.

The padding is therefore not an optimization available only to one side of the comparison; it keeps bounds handling out of the measured kernels entirely. All decoders operate under the same padded-input contract, so the measured advantage cannot be attributed to a scalar- or vector-specific boundary-handling path.

Experimental provenance

The reported throughput measurements come from one publication-grade run using:

3 forks
5 × 3-second warmup iterations
10 × 1-second measurement iterations

The processor governor was fixed to performance. The hardware counters (retired instructions per decoded value) were collected separately, on the same host, with shorter perfnorm passes:

1 fork
2 × 1-second warmup iterations
3 × 1-second measurement iterations

These counter passes supply the instruction-count metrics, not the throughput numbers.

The benchmark source, raw results, and verification scripts are available in the accompanying repository. The publication run, which regenerates the report, can be reproduced with:

./run_lab.sh --publication

Interpreting the hardware counters

Throughput is the primary performance result, but it can vary with processor frequency, core placement, and other run-time conditions. Hardware counters help explain whether a throughput difference corresponds to a different machine-code shape rather than to measurement noise alone.

For this reason, the analysis also considers retired instructions per decoded value: retired machine instructions for the measured kernel divided by the number of decoded values, not a count of source-level Vector API operations. This metric is especially useful when its pattern remains stable across runs and bit widths: it shows how much dynamic instruction work each implementation executes for every value decoded.

The vector kernel retires far fewer instructions per decoded value than the generic scalar decoder: approximately 1.57 instead of 15.2.

This does not mean that it should be 9.7 times faster. Different machine instructions have different costs, and a vector instruction may perform a wider and more complex operation than a scalar instruction. The vector kernel, for example, uses byte rearrangement and per-lane shifts.

The instruction counts are therefore consistent with substantially different execution shapes. The throughput measurement shows the actual performance consequence: on this machine, the lower instruction count translates into an improvement of approximately 4.7 times.

Results

The three baselines answer progressively different questions. The first measures the advantage over a generic scalar decoder for which SuperWord provides no measurable benefit. The second gives the compiler a regular, byte-aligned loop. The third compares the vector kernel with Parquet Java’s width-specialized scalar implementation.

Baseline 1: a generic scalar decoder

The first baseline is the generic scalar loop shown earlier.

It loads a word from a variable byte position, shifts it by a variable offset, applies a mask, and stores one decoded value.

The hardware counters show approximately:

15.2 instructions per decoded value

This value remains nearly constant across the tested widths.

Disabling SuperWord does not materially change the measurements. On the tested system, the loop therefore serves as the effectively scalar baseline.

Against this baseline, the explicit vector kernel reaches approximately 4.7 times the throughput.

That result is valid, but it answers a limited question:

Is explicit SIMD faster than a scalar loop for which SuperWord provides no measurable benefit?

The answer is yes, but this comparison establishes only the upper end of the observed advantage.

The comparison does not yet show how much advantage remains when the scalar code has a shape that the compiler can optimize more effectively.

Baseline 2: a regular widening loop

At an aligned width such as 8 bits, decoding can be expressed as a simple widening conversion:

for (int i = 0; i < count; i++) {
    output[i] = source[i] & 0xFF;
}

This is a stronger baseline because its fixed-stride memory accesses and independent iterations give C2 a realistic opportunity to optimize the loop.

This widening loop is not presented as an alternative implementation for arbitrary-width unpacking. It serves as a compiler-friendly reference that shows how much advantage remains when C2 is given an almost ideal scalar loop shape.

With the default compiler configuration, it processed approximately:

5,961 values per microsecond

The explicit vector implementation was approximately 1.65 times faster.

That is a meaningful gain, but it is substantially smaller than the 4.7-fold advantage measured against the generic scalar decoder.

The benchmark also produced an unexpected result. With SuperWord disabled, the same widening loop processed approximately:

7,495 values per microsecond

It was faster without the optimization pass. When the vector implementation is compared with this faster scalar result, its advantage falls to approximately 1.3 times.

This result should be interpreted narrowly. It does not show that auto-vectorization is generally harmful. It establishes only that, on this host and for this loop, enabling SuperWord produced a measurable performance regression.

Bytecode inspection cannot explain the regression because it arises during C2 optimization and machine-code generation, after the bytecode has been parsed. Readable assembly was not available on the benchmark host, so the precise mechanism remains unresolved.

The broader result does not depend on that mechanism: when the scalar baseline has a regular, compiler-friendly shape, explicit vectorization has substantially less room to improve it.

Baseline 3: Parquet Java’s specialized scalar fallback

Parquet Java provides scalar unpacking routines specialized for individual bit widths. These routines are part of the core parquet-encoding module and form the commonly available fallback when the separate optional vector module is absent or not selected.

Instead of calculating the extraction position generically for every value, each routine encodes the required loads, shifts, masks, and combinations directly for one particular bit width.

This removes some runtime calculations, but it can also expand the decoding operation into a longer sequence of scalar instructions. At wider, irregular bit widths, a value may have to be assembled from several parts of the packed input, requiring additional loads, shifts, and bitwise combinations.

Specialization and vectorization are therefore separate optimizations. Specialization fixes the extraction pattern in advance; vectorization applies operations to several values in parallel. If C2 does not derive an effective vector form from the specialized routine, the result may simply be a longer scalar instruction sequence than the generic decoder executes.

That is what the measurements indicate at several wide, irregular widths on the tested system. Parquet’s specialized scalar unpacker retires approximately 8 instructions per value at smaller widths, but up to approximately 23 at wider widths. Disabling SuperWord leaves the count essentially unchanged, providing no measurable evidence that SuperWord improves these routines on this system.

At widths 17 and 25, the specialized routines therefore retire more instructions and achieve lower throughput than the generic scalar decoder:

width 17:
Parquet specialized scalar: approximately 1,277 values/µs
generic scalar:             approximately 1,992 values/µs

and:

width 25:
Parquet specialized scalar: approximately 1,004 values/µs
generic scalar:             approximately 1,992 values/µs

Throughput of the generic and specialized Parquet scalar decoders: at widths 17 and 25 the specialized routine retires more instructions and falls below the generic loop.

Figure 1: throughput of the generic and specialized Parquet scalar decoders. At widths 17 and 25, the specialized implementation retires more instructions and falls below the generic loop.

The result is narrow. It does not mean that Parquet is generally slow. The comparison concerns only Parquet Java’s scalar fallback. The same library also ships a separate optional vector implementation, selected only on supported platform and processor configurations; that path is not being evaluated in this baseline.

Nor does the result show that per-width specialization is inherently harmful. Its effectiveness depends on the implementation, compiler, processor, and bit width.

The supported conclusion is:

On this host, Parquet’s per-width scalar specialization did not improve performance monotonically. At some wide, irregular widths, it was slower than the generic scalar loop.

The main result

Up to a bit width of 25, the explicit vector kernel retires approximately 1.57 instructions per decoded value, with little variation across the tested widths. The reason is structural: the kernel performs the same load → rearrange → shift → mask → store sequence at every width. Only the shuffle, shift vector, and mask change.

The scalar baselines do not show the same stability. Their cost depends on the shape of the loop and on the machine code C2 derives from it.

Instructions per decoded value: the vector kernel stays nearly flat through width 25, the generic decoder stays around 15.2, and Parquet’s specialized scalar cost increases with the width.

Figure 2: instructions per decoded value. The vector kernel stays nearly flat through width 25. The generic decoder remains at approximately 15.2 instructions per value, while Parquet’s specialized scalar cost increases with the width.

The measurements represent three different compiler situations rather than points on a single speedup curve.

At width 8, decoding is a regular widening operation with fixed-stride memory accesses. Depending on whether the default or the fastest measured scalar configuration is used, the explicit vector kernel is approximately 1.3 to 1.65 times faster.

Width 16 remains byte-aligned, but its scalar cost is already substantially higher. Enabling SuperWord changes the instruction count only slightly, from 6.53 to 6.38 instructions per value. Without readable assembly or compiler diagnostics, the experiment cannot determine whether the loop remained scalar, was partially vectorized, or produced a vector form whose additional operations erased the expected gain. What the measurements establish is that byte alignment alone does not produce a scalar baseline as lean as the width-8 widening loop. The explicit vector kernel is approximately 2.4 to 2.65 times faster.

At irregular widths such as 13, 17, and 25, enabling SuperWord provides no measurable benefit for the relevant scalar unpacking loops. Against these baselines, the explicit vector kernel reaches approximately 4.7 times the throughput.

The larger speedup at irregular widths does not come from the vector kernel becoming faster. Its cost remains essentially constant. What changes is the cost and compiler behavior of the scalar baseline.

Throughput by compiler situation: the vector/scalar ratios differ between aligned and irregular widths, so they are not points on one continuous curve.

Figure 3: throughput by compiler situation. The ratios are not points on one continuous curve because the relevant scalar baseline differs between aligned and irregular widths.

The width-26 transition

The stable vector result has a clear boundary.

Up to 25 bits, each value together with its maximum intra-byte offset fits inside a 32-bit working lane. On the benchmark host, a 512-bit vector can therefore process sixteen 32-bit lanes at once:

512 bits / 32 bits = 16 lanes

At width 26, a 32-bit working window is no longer sufficient. The kernel switches to 64-bit lanes:

512 bits / 64 bits = 8 lanes

This halves the number of values processed per vector operation. The decoded values must also be narrowed from 64-bit lanes back to Java int values.

The measured instruction cost rises from approximately 1.57 to 2.88 instructions per value, while throughput falls by about 1.4 times.

The two ratios are not expected to match. The additional narrowing operation and the wider lane operations have different execution costs and may overlap differently inside the processor. The instruction count identifies the change in computational shape; it does not predict throughput proportionally.

This transition differs from the scalar behavior at irregular widths. The scalar results depend on what C2 can derive from each loop. The width-26 transition follows directly from the design of the vector kernel: it uses fewer lanes and requires an additional narrowing step.

The exact transition is specific to the preferred vector species on this host. Systems with narrower vectors may use a different implementation path above 25 bits, potentially including a scalar fallback.

What the speedups mean

It would be tempting to summarize the experiment as:

The Vector API makes bit unpacking five times faster.

That statement is valid only against the generic scalar decoder, for which SuperWord provides no measurable benefit on the tested system.

Against the regular width-8 widening loop, the advantage is approximately 1.3 to 1.65 times. Against the irregular scalar unpacking loops, it rises to approximately 4.7 times.

These results are not contradictory. They compare the same vector kernel with scalar baselines that exhibit substantially different instruction counts and throughput characteristics.

A SIMD speedup is therefore meaningful only when the scalar baseline gives the JIT a realistic opportunity to optimize the loop. Otherwise, a large ratio may reflect the limitations of the comparison code as much as the strength of the vector implementation.

Conclusion

The experiment shows that explicit vectorization is most valuable when parallel work exists but is hidden behind access patterns and operations that the compiler cannot recover reliably.

Three practical consequences follow.

First, explicit SIMD should be benchmarked against the strongest scalar implementation the JIT can actually produce, not merely against the most obvious scalar loop.

Second, specialization and vectorization must be evaluated separately. Encoding a width-specific extraction pattern removes some runtime calculations, but it does not guarantee parallel machine code. On the tested host, some of Parquet Java’s specialized scalar routines retire more instructions and achieve lower throughput than the generic decoder at wide, irregular widths.

Third, the Vector API should be used selectively. For regular loops that the compiler already handles effectively, the additional complexity may buy only a modest gain. For irregular hot kernels, expressing the vector structure directly can provide both higher throughput and more predictable behavior across input shapes.

Parquet bit unpacking is the concrete case examined here, but the same reasoning applies to JVM kernels used in parsing, compression, encoding, hashing, and numerical processing.

References

Daniel Lemire and Leonid Boytsov. “Decoding billions of integers per second through vectorization.” Software: Practice & Experience, 45(1), pp. 1–29, 2015. DOI: 10.1002/spe.2203.
Apache Parquet. “PARQUET-2159: Parquet bit-packing de/encode optimization.” Apache Software Foundation issue tracker, https://issues.apache.org/jira/browse/PARQUET-2159. The optional vector path ships as a separate parquet-encoding-vector module; the version examined here is parquet-java 1.16.0.
OpenJDK. “JEP 338: Vector API (Incubator).” Initial Vector API incubation in JDK 16. https://openjdk.org/jeps/338
OpenJDK. “JEP 508: Vector API (Tenth Incubator).” Vector API incubation used in JDK 25. https://openjdk.org/jeps/508
OpenJDK. “JEP 454: Foreign Function & Memory API.” Finalized in JDK 22. https://openjdk.org/jeps/454
Daniel Lemire, Leonid Boytsov, and Nathan Kurz. “SIMD Compression and the Intersection of Sorted Integers.” Software: Practice & Experience, 46(6), pp. 723–749, 2016. DOI: 10.1002/spe.2326.
Christian Del Monte. decode-shape: benchmark, measurement, and verification harness for the experiments in this article. https://github.com/cdelmonte-zg/decode-shape

Introduction#

From Parquet dictionary encoding to bit unpacking#

Dictionary indices#

The encoded-to-decoded boundary#

From packed bits to integers#

How Java reaches SIMD#

Scalar execution and SIMD#

Two ways Java reaches vector instructions#

Why loop shape matters#

The explicit vector kernel#

Why the baseline matters#

Positioning the experiment#

Relation to existing work#

What this experiment measures differently#

Experimental setup#

Benchmark environment#

Scope and limitations#

The role of MemorySegment#

Input padding and boundary handling#

Experimental provenance#

Interpreting the hardware counters#

Results#

Baseline 1: a generic scalar decoder#

Baseline 2: a regular widening loop#

Baseline 3: Parquet Java’s specialized scalar fallback#

The main result#

The width-26 transition#

What the speedups mean#

Conclusion#

References#