Unpacking Parquet: Explicit SIMD, Scalar Baselines, and What HotSpot Makes of Them

Introduction

On the JVM, optimizing a hot kernel is not only about writing faster code. It also requires understanding the machine code HotSpot derives from the scalar implementation.

Using Parquet bit-unpacking as a concrete case, this experiment asks:

When does an explicitly vectorized routine provide a real advantage over ordinary Java code optimized by HotSpot?

A scalar Java loop is not a neutral baseline. HotSpot may recognize its structure and emit SIMD instructions, or leave it scalar when the access pattern is harder to prove. The measured speedup therefore depends not only on the vector implementation, but also on the scalar loop against which it is compared.

The results show when explicit vectorization is justified, how strongly its advantage depends on loop shape, and why a more specialized scalar routine is not necessarily faster.

From bit-packed indices to integers

Parquet dictionary encoding stores repeated values once and represents them in the data page through small integer indices. Those indices are bit-packed using only the number of bits the dictionary size requires: a dictionary of up to 8192 entries needs 13 bits per index. Reading the page therefore requires reconstructing an int[] of indices before the dictionary lookup can recover the original values.

This article isolates that reconstruction step:

bit-packed dictionary indices
            ↓
bit unpacking
            ↓
int[] of dictionary indices

To decode one value, a scalar decoder locates where it starts in the bit stream, loads the bytes around it, shifts away the preceding bits, applies a mask, and stores the resulting integer:

long bitPosition = 0;
for (int i = 0; i < count; i++) {
    long bytePosition = bitPosition >>> 3;
    int bitOffset = (int) (bitPosition & 7);
    long word = source.get(LONG_LE, bytePosition);
    output[i] = (int) ((word >>> bitOffset) & mask);
    bitPosition += bitWidth;
}

The decoder processes one value at a time. At a width of 13 bits, the bit positions advance by a fixed increment, so the pattern is perfectly regular; but it is not a constant-stride, element-by-element array operation, and that is the distinction the compiler cares about. A more detailed introduction to Parquet dictionary encoding and bit unpacking is available separately, in From Dictionary Indices to Integers.

How Java reaches SIMD

Scalar execution and SIMD

A scalar loop processes one value after another:

value 0 → shift → mask → store
value 1 → shift → mask → store
value 2 → shift → mask → store
...

Modern processors can also apply one operation to several values at once. This is SIMD: Single Instruction, Multiple Data.

For example, on the machine used for this experiment, a 512-bit vector register can hold sixteen 32-bit integer lanes:

┌────┬────┬────┬────┬────┬────┬────┬────┐
│ v0 │ v1 │ v2 │ v3 │ .. │v13 │v14 │v15 │
└────┴────┴────┴────┴────┴────┴────┴────┘

A vector shift can operate on all those lanes. The same applies to masking and other arithmetic operations.

Java has two different ways to reach such instructions.

Two ways Java reaches vector instructions

The first route is auto-vectorization.

HotSpot contains an optimizing just-in-time compiler called C2. When a method becomes sufficiently hot, C2 compiles its bytecode into optimized machine code.

One of its optimization passes, called SuperWord, looks for isomorphic, independent scalar operations that can be packed into SIMD operations, provided their memory dependencies and access patterns can be proven safe.

Ordinary scalar-looking Java code can therefore become vector machine code.

For example:

for (int i = 0; i < count; i++) {
    output[i] = a[i] + b[i];
}

The programmer has written a scalar loop. Each source-level iteration adds a single pair of elements.

C2 may nevertheless recognize that the iterations are independent and access memory in a regular stride, and emit machine code that processes several elements at once:

ordinary scalar Java loop
        ↓
C2 and SuperWord analyze the loop
        ↓
SIMD machine instructions, if recognition succeeds

The second route is the Vector API.

With the Vector API, the programmer expresses the vector computation directly. On supported hardware, and for operations the target can map efficiently, HotSpot lowers it to the matching SIMD instructions; otherwise the API stays functional but may fall back to a less efficient form:

Vector API operations
        ↓
HotSpot lowers the explicit vector computation
        ↓
SIMD instructions when supported efficiently

The Vector API did not introduce SIMD to Java. HotSpot had already been capable of generating vector instructions from suitable scalar loops for many years.

The difference is where the vector structure comes from. With auto-vectorization, C2 must infer that structure from scalar code. With the Vector API, the structure is explicit in the program.

This distinction leads to the central question:

When can C2 derive efficient machine code from ordinary scalar Java, and when does the vector structure need to be expressed explicitly through the Vector API?

Why loop shape matters

C2 does not vectorize every loop. It must be able to prove that the transformation is safe and recognize a pattern that it knows how to combine.

A byte-aligned conversion is comparatively simple:

for (int i = 0; i < count; i++) {
    output[i] = source[i] & 0xFF;
}

Every iteration reads the next byte, applies the same mask, and writes the next integer:

source[0] → output[0]
source[1] → output[1]
source[2] → output[2]

The memory accesses have a fixed stride. The iterations are independent. The operation is identical for every element.

This regular shape allows SuperWord to recognize independent parallel work and construct load and store packs. Whether those packs can be connected into a vectorized computation depends on the operations between them. Here that operation is an implicit unsigned widening, and on the examined JDK it lies outside the conversions SuperWord can currently vectorize, so the loop stays scalar.

Bit unpacking at 13 bits has a different shape.

Successive values begin at changing positions:

value 0 starts at bit 0
value 1 starts at bit 13
value 2 starts at bit 26
value 3 starts at bit 39

The scalar decoder calculates a byte position and bit offset for every value:

long bytePosition = bitPosition >>> 3;
int bitOffset = (int) (bitPosition & 7);
long word = source.get(LONG_LE, bytePosition);
output[i] = (int) ((word >>> bitOffset) & mask);
bitPosition += bitWidth;

From the compiler’s perspective, the loop contains:

input positions derived from a changing bit position;
shift distances that vary between iterations;
values that may cross byte boundaries;
relationships between neighboring bytes that are not expressed as direct element-wise operations.

A programmer can see that several values could still be extracted in parallel. The compiler, however, would have to reconstruct that higher-level structure from the scalar operations.

The explicit Vector API kernel avoids that recognition problem by expressing the parallel structure directly.

The explicit vector kernel

The vector implementation processes a block of packed values in several stages:

load packed bytes
        ↓
rearrange bytes into vector lanes
        ↓
shift each lane by its required offset
        ↓
apply the bit mask
        ↓
store decoded integers

In simplified Vector API form:

ByteVector.fromMemorySegment(
            BYTE_SPECIES,
            source,
            baseByte,
            LITTLE_ENDIAN)
        .rearrange(shuffle)
        .reinterpretAsInts()
        .lanewise(LSHR, shiftVector)
        .lanewise(AND, maskVector)
        .intoArray(output, outputOffset);

Three values depend on the bit width:

the byte shuffle;
the shift applied to each lane;
the mask.

They are computed for the selected width and reused.

The sequence of operations does not otherwise change for widths up to 25 bits:

load → rearrange → shift → mask → store

Across widths up to 25 bits, the parameters change, but the kernel still performs the same classes of operation.

Its computational shape is therefore largely independent of the width.

That is the structural reason the vector cost stays nearly flat across widths, which the measurements confirm later.

Why the baseline matters

The term “scalar baseline” describes source code, not necessarily the quality or even the execution shape of the machine code produced from it.

Two loops written as ordinary scalar Java may lead HotSpot to generate substantially different code. A regular loop may compile into a compact and efficient scalar form, or in some cases into SIMD instructions. A more irregular loop may retain address calculations, variable shifts, and other per-value work that the compiler cannot reorganize effectively.

Comparing the Vector API only with the latter can make explicit SIMD appear overwhelmingly superior. Comparing it with the strongest machine code HotSpot can derive from a relevant scalar form shows how much advantage remains against a genuinely competitive baseline.

The experiment therefore uses several scalar baselines. It does not assume that a compiler-friendly source loop is successfully auto-vectorized; the generated machine code is itself part of the evidence.

Positioning the experiment

Relation to existing work

Vectorized integer unpacking is established work.

Lemire and Boytsov demonstrated that blocks of compressed integers can be decoded at very high throughput using explicit SIMD implementations. Their work helped establish a family of techniques based on fixed-width blocks, width-specific decoding strategies, vector rearrangement, shifts, and masks.

PARQUET-2159 brought this line of optimization directly into Parquet Java. Its benchmarks compared the existing Parquet unpacking implementation with routines written using the Java Vector API. The resulting work became the basis of Parquet Java’s optional vector implementation.

What this experiment examines differently

This article does not introduce a new unpacking technique. Instead, it examines how the measured advantage of explicit vectorization changes with the scalar baseline against which it is evaluated.

Previous work has established that explicit SIMD unpacking can outperform scalar implementations. In this experiment, however, the baseline itself is treated as part of the result. The comparison includes regular scalar loops whose performance is materially affected by the machine code HotSpot can derive from them, as well as irregular loops for which SuperWord provides no measurable benefit. This makes it possible to distinguish how much of the observed SIMD advantage comes from the explicitly vectorized kernel and how much comes from the particular scalar loop used as its baseline.

Machine-code and compiler-trace inspection add a further result: a regular scalar baseline can remain highly competitive without being auto-vectorized. For the byte-to-int widening loop, SuperWord constructs implementable load and store packs but rejects their def-use relationship because their vector element sizes differ and the implicit widening is not one of its recognized conversion idioms.

The experiment also evaluates Parquet Java’s specialized scalar routines separately from its vectorized path. The results show that scalar specialization does not necessarily produce a stronger baseline, particularly for wider and more irregular bit widths.

Experimental setup

Benchmark environment

The measurements come from a controlled, single-threaded microbenchmark on one machine:

AMD Ryzen 9 7950X3D
Zen 4
x86-64 Linux, little-endian
AVX-512 with VBMI/VBMI2
OpenJDK 25.0.3
Vector API, tenth incubator
512-bit preferred vector species

This configuration satisfies the platform and CPU-feature checks applied by ParquetReadRouter in the Parquet Java version examined here.

Scope and limitations

The benchmark measures in-cache decoding kernels rather than the complete cost of reading a Parquet file.

A real Parquet scan may also include:

storage access;
page reads;
decompression;
the dictionary lookup (gather into the dictionary page);
null handling;
filtering;
memory allocation;
scheduling;
downstream execution;
memory-bandwidth contention.

In a complete query, these costs may dominate bit unpacking.

The purpose of the microbenchmark is not to predict total query speed. It is to isolate the encoded-to-decoded boundary and make its computational shape measurable. The benchmark stops once it has produced the int[] of decoded indices; it does not include the subsequent dictionary lookup or the materialization of the original column values.

The results are also specific to one processor and one JDK.

A different machine or compiler version could change:

the absolute throughput;
the machine code C2 produces for the scalar baselines, including whether it auto-vectorizes them;
the effectiveness of the specialized scalar routines;
the magnitude or nature of the width-26 transition;
the behavior of SuperWord on the widening loop;
the cost of the vector kernel’s own operations, such as byte rearrangement and other cross-lane steps, whose lowering depends on the available instruction set.

The role of MemorySegment

Before comparing baselines, one possible confound must be removed. The experiment reads its input through a MemorySegment.

A MemorySegment is a bounded region of memory, on the heap or off-heap, introduced by the Foreign Function & Memory API finalized in JDK 22.

It would be easy to assume that MemorySegment is responsible for a large part of the measured speedup.

It is not.

In the benchmark, a scalar decoder reading from a heap byte[] and an equivalent decoder reading from an off-heap MemorySegment produced almost identical throughput.

This result is unsurprising. The two implementations use the same extraction algorithm, perform the same bit arithmetic, and operate on data that fits in cache. Only the source of the load changes.

MemorySegment is therefore not the performance result; its role is architectural. It supplies a bounded memory abstraction from which the Vector API can load directly, without requiring an intermediate Java array.

It is the substrate of the vector implementation, not its explanation.

Input padding and boundary handling

The decoders use fixed-width loads that may read beyond the last byte logically required for the values being decoded. The scalar decoders load one 64-bit word per value, while the main vector kernel loads a 64-byte ByteVector per block; values outside complete vector blocks are handled by a scalar fallback.

To keep these loads within the allocated memory region without adding an end-of-buffer branch to the measured kernels, the benchmark appends 64 zero-filled bytes to the encoded data. The padding is sized for the widest load: the vector block.

All decoders operate under the same padded-input contract. The measured advantage therefore cannot be attributed to a scalar- or vector-specific end-of-buffer path.

Measurement strategy

Throughput is the primary performance result, but it can vary with processor frequency, core placement, and other run-time conditions. Hardware counters provide complementary evidence by showing whether a throughput difference is accompanied by a substantial difference in the dynamic work performed by the implementations.

For this reason, the analysis also considers retired instructions per decoded value: the number of retired machine instructions attributed to the measured kernel, divided by the number of values decoded. It is not a count of source-level Vector API operations.

This metric is especially useful when its pattern remains stable across runs and bit widths. It indicates how much dynamic instruction work an implementation executes for every decoded value.

Instruction counts must not be interpreted as a direct prediction of throughput. Machine instructions differ in latency, throughput, and execution-resource requirements, and one vector instruction may perform a wider or more complex operation than one scalar instruction. Operations such as byte rearrangement and per-lane shifting may therefore retire relatively few instructions while still carrying a nontrivial execution cost.

The counters are consequently used as explanatory evidence rather than as a substitute for throughput measurements. Machine-code inspection identifies the instructions emitted by C2, while perfasm indicates where sampled cycles accumulate within the resulting hot loops.

Experimental provenance

The throughput measurements, hardware-counter data, and machine-code evidence described above were collected in separate runs.

The reported throughput measurements come from one primary benchmark run using:

3 forks
5 × 3-second warmup iterations
10 × 1-second measurement iterations

The processor governor was fixed to performance. The benchmark did not pin its forks to specific cores: thread placement was left to the operating-system scheduler, so on this multi-CCD processor individual measurements can include core-placement effects. A separate verification run with all forks pinned to a single CCD reproduced the reported ratios within normal run-to-run variability, so the comparisons are not an artifact of core placement.

The hardware counters, including retired instructions per decoded value, were collected separately on the same host using shorter perfnorm passes:

1 fork
2 × 1-second warmup iterations
3 × 1-second measurement iterations

The generated C2 machine code was inspected for the width-8 and width-16 widening loops with SuperWord enabled and disabled. Perfasm profiling was then used to attribute sampled cycles to instructions within the hot loops. This makes it possible to distinguish static code-shape differences from their dynamic cost. Cycle attribution remains approximate on out-of-order hardware.

The benchmark source, raw results, verification scripts, and extracted machine code are available in the accompanying repository. For all four widening configurations, the repository includes the decoder method across all compilation tiers, with the C2 hot loop identified. It also provides a script for building a JDK 25-compatible hsdis plugin and the exact command used to regenerate the complete disassembly.

The disassembly, SuperWord trace, and perfasm attribution come from separate follow-up runs on the same host and JDK. They explain the primary throughput measurements but are not part of the primary benchmark run.

The publication profile and verification report can be reproduced with:

./run_lab.sh --publication

Results

The experiment uses three baselines to answer distinct but related questions. The first measures the advantage over a generic scalar decoder for which SuperWord provides no measurable benefit. The second gives the compiler a regular, byte-aligned loop and then inspects what HotSpot actually produces from it. The third compares the vector kernel with Parquet Java’s width-specialized scalar implementation.

Baseline 1: a generic scalar decoder

The first baseline is the generic scalar loop shown earlier.

It loads a word from a variable byte position, shifts it by a variable offset, applies a mask, and stores one decoded value.

The hardware counters show approximately:

15.2 instructions per decoded value

This value remains nearly constant across the tested widths.

Disabling SuperWord does not materially change the measurements. On the tested system, the loop therefore serves as a baseline for which SuperWord provides no measurable benefit.

Across the reported irregular widths, the explicit vector kernel reaches approximately 4.7 times the throughput of this generic decoder.

That result is valid, but it answers a limited question:

Is explicit SIMD faster than a scalar loop for which SuperWord provides no measurable benefit?

The answer is yes, but this comparison captures the case in which the generic scalar decoder retains substantial per-value work and receives no measurable benefit from SuperWord.

The comparison does not yet show how much advantage remains when the scalar code has a shape that the compiler can optimize more effectively.

Baseline 2: a regular widening loop

At a byte-aligned width of 8 bits, decoding can be expressed as a simple widening conversion:

for (int i = 0; i < count; i++) {
    output[i] = source[i] & 0xFF;
}

Its fixed-stride memory accesses and independent iterations make it a highly regular and competitive scalar baseline. It is not, however, a positive example of SuperWord auto-vectorization. The loop contains an implicit unsigned widening from byte-sized loads to integer-sized stores, a relationship that the SuperWord implementation in the examined JDK does not recognize as a supported conversion idiom.

The purpose of this baseline is therefore to measure how much advantage explicit vectorization retains against efficient scalar machine code derived from a regular decoding loop. A separate element-width-preserving addition loop is used later as a positive control to verify that SuperWord can successfully vectorize suitable code on the same host and JDK.

With the default compiler configuration, the loop processed approximately 5,961 values per microsecond. The explicit vector implementation was approximately 1.65 times faster, a meaningful gain, but substantially smaller than the 4.7-fold advantage measured against the generic scalar decoder.

Disabling SuperWord produced an unexpected result: throughput increased to approximately 7,495 values per microsecond. Against this faster scalar result, the advantage of the explicit vector implementation fell to approximately 1.3 times.

Why enabling SuperWord makes the scalar loop slower

Machine-code inspection explains the regression. Neither configuration emits SIMD code for the widening operation. With SuperWord disabled, C2 produces a compact loop unrolled eight ways, keeping each decoded value in a general-purpose register until it is stored. With SuperWord enabled, C2 produces a larger loop unrolled sixteen ways, with more simultaneously live scalar values. Several of them are routed through XMM registers as temporary storage and then moved back to general-purpose registers before being stored.

SuperWord OFF (loop unrolled 8×):
    source[i] → GPR → output[i]

SuperWord ON (loop unrolled 16×):
    source[i] → GPR → XMM → GPR → output[i]
                \_____________/
                scalar round trip
                through XMM

These transfers perform no SIMD computation. They add dynamic instruction work and are consistent with the lower throughput. At width 8, approximately 31% of the sampled cycles reported by perfasm within the SuperWord-enabled hot loop are associated with these scalar GPR–XMM round trips, which are absent from the disabled configuration. They are therefore a substantial source of additional dynamic work in the slower loop.

Why the widening loop is not vectorized

A SuperWord diagnostic trace shows why the widening operation remains scalar. C2 recognizes that the loop iterations are independent and constructs a 16-lane LoadUB pack and a 16-lane StoreI pack. Both packs are individually implementable on the target.

The problem is the connection between them:

LoadUB: 16 × 1-byte elements
        ↓ widening
StoreI: 16 × 4-byte elements

SuperWord normally requires a packed definition and its use to have compatible vector element sizes. It supports size-changing relationships only for a limited set of recognized widening idioms. Here, the widening is implicit in LoadUB rather than represented by one of those idioms, so the load and store packs cannot be connected into an accepted vector def-use chain.

This limitation is recognized upstream. The OpenJDK enhancement JDK-8375502, C2 SuperWord: implement unsigned casts, describes the same situation: SuperWord finds LoadUB and StoreI packs that are directly connected without an explicit cast, so the widening is implicit, and proposes inserting the missing cast during auto-vectorization. The enhancement remains open. On the JDK 25 examined here, the widening loop therefore remains scalar.

By contrast, an element-width-preserving loop such as:

for (int i = 0; i < count; i++) {
    output[i] = a[i] + b[i];
}

has a homogeneous def-use chain:

LoadI → AddI → StoreI
4 B     4 B    4 B

SuperWord can connect these packs and vectorize the loop.

A positive control rules out a general limitation of the host or the SuperWord pipeline: on the same machine and JDK, the addition loop above vectorizes while the byte-to-int widening loop does not:

SuperWord stage	homogeneous `a[i] + b[i]`	widening `src[i] & 0xFF`
recognizes parallel work	yes	yes
builds load and store packs	yes	yes
packs implementable on target	yes	yes
def-use element sizes compatible	yes (4 B = 4 B)	no (1 B vs 4 B)
outcome	vectorized	scalar

This does not imply that SuperWord is generally harmful. For this loop, host, and JDK, however, enabling it produces the less efficient of two scalar forms.

The supported conclusion is:

On this host and JDK, neither of the two inspected scalar configurations emits SIMD for the widening loop. The default configuration therefore represents the normal HotSpot baseline, while the SuperWord-disabled configuration provides the fastest scalar compilation observed in this experiment and thus the more demanding comparison for the explicit vector kernel.

Baseline 3: Parquet Java’s specialized scalar fallback

Parquet Java provides scalar unpacking routines specialized for individual bit widths. These routines are part of the core parquet-encoding module and form the commonly available fallback when the separate optional vector module is absent or not selected.

Instead of calculating the extraction position generically for every value, each routine encodes the required loads, shifts, masks, and combinations directly for one particular bit width.

This removes some runtime calculations, but it can also expand the decoding operation into a longer sequence of scalar instructions. At wider, irregular bit widths, a value may have to be assembled from several parts of the packed input, requiring additional loads, shifts, and bitwise combinations.

Specialization and vectorization are therefore separate optimizations. Specialization fixes the extraction pattern in advance; vectorization applies operations to several values in parallel. If C2 cannot reduce the expanded extraction sequence to sufficiently efficient machine code, the result may simply be a longer scalar instruction sequence than the generic decoder executes.

That is what the measurements indicate at several wide, irregular widths on the tested system. Parquet’s specialized scalar unpacker retires approximately 8 instructions per value at smaller widths, but up to approximately 23 at wider widths. Disabling SuperWord leaves the count essentially unchanged, providing no measurable evidence that SuperWord improves these routines on this system. These counts establish that the specialized routines perform substantially more dynamic work at the wider widths. Unlike the width-8 loop, this experiment does not attribute that cost at the disassembly level. The source-level structure is nevertheless consistent with the result: at these widths, the specialized extraction sequences contain additional loads, shifts, and combinations relative to the generic word-based decoder.

At widths 17 and 25, the specialized routines therefore retire more instructions and achieve lower throughput than the generic scalar decoder:

width 17:
Parquet specialized scalar: approximately 1,277 values/µs
generic scalar:             approximately 1,992 values/µs

and:

width 25:
Parquet specialized scalar: approximately 1,004 values/µs
generic scalar:             approximately 1,992 values/µs

Throughput of the generic and specialized Parquet scalar decoders: at widths 17 and 25 the specialized routine retires more instructions and falls below the generic loop.

Figure 1: throughput of the generic and specialized Parquet scalar decoders. At widths 17 and 25, the specialized implementation retires more instructions and falls below the generic loop.

The result is narrow. It does not mean that Parquet is generally slow, nor that per-width specialization is inherently harmful: the comparison concerns only Parquet Java’s scalar fallback, not its separate optional vector implementation (selected only on supported platform and processor configurations), and the effectiveness of specialization depends on the implementation, compiler, processor, and bit width.

The supported conclusion is:

On this host, Parquet’s per-width scalar specialization did not improve performance monotonically. At some wide, irregular widths, it was slower than the generic scalar loop.

The main performance result

Across the tested widths up to 25 bits, including the irregular widths, the computational shape of the explicit vector kernel remains materially unchanged. It performs the same load → rearrange → shift → mask → store sequence, while only the shuffle, shift vector, and mask vary.

The measurements reflect this stability. In the counter measurements, the vector kernel retires between 1.566 and 1.570 instructions per decoded value across these widths. The scalar results are less uniform: the generic decoder remains nearly flat at a much higher cost, while Parquet’s specialized scalar cost increases with the width.

Instructions per decoded value, by decoder and bit width.

Figure 2: instructions per decoded value. The vector kernel stays nearly flat through width 25. The generic decoder remains at approximately 15.2 instructions per value, while Parquet’s specialized scalar cost increases with the width.

The resulting speedups describe distinct compiler situations rather than a single continuous trend.

At width 8, the regular widening loop gives HotSpot a highly favorable scalar shape. Depending on whether the default or fastest observed scalar configuration is used, the explicit vector kernel is approximately 1.3 to 1.65 times faster.

At width 16, scalar reconstruction requires more work even though the input remains byte-aligned. The explicit vector kernel is approximately 2.4 to 2.65 times faster.

At irregular widths such as 13, 17, and 25, SuperWord provides no measurable benefit to the generic scalar decoder, and the explicit vector kernel reaches approximately 4.7 times its throughput.

The larger speedups at the irregular widths do not result from the vector kernel becoming faster. Its execution cost remains nearly constant through width 25; what changes is the amount and shape of the work that remains in the scalar implementations.

Throughput by compiler situation, aligned versus irregular widths.

Figure 3: throughput by compiler situation. The vector-to-scalar ratios differ because each comparison uses a scalar implementation with a different operation and machine-code shape.

The width-26 transition

The stable vector result has a clear boundary.

To extract a value, the kernel needs a working window large enough to contain both the value and its possible intra-byte offset. In the worst case, this requires bitWidth + 7 bits. Up to width 25, the required window fits within a 32-bit lane:

25 + 7 = 32 bits

On the benchmark host, a 512-bit vector can therefore hold sixteen such lanes:

512 bits / 32 bits = 16 lanes

At width 26, the working window grows to 33 bits and no longer fits:

26 + 7 = 33 bits

The kernel consequently switches to 64-bit lanes. A 512-bit vector then holds only eight lanes:

512 bits / 64 bits = 8 lanes

This halves the number of values processed per vector block and requires the decoded values to be narrowed back to Java int values before storage.

The measured instruction cost rises from approximately 1.57 to 2.88 instructions per value, while throughput falls by a factor of approximately 1.4.

The two ratios are not expected to match. Wider lanes reduce lane-level parallelism and introduce a narrowing step, but these operations have different execution costs and may overlap within the processor. Retired-instruction count reveals the change in computational shape; it does not predict throughput proportionally.

Unlike the width-dependent behavior of the scalar implementations, this transition follows directly from the vector kernel’s lane-width requirement. Its exact performance impact remains platform-dependent, because it depends on the preferred vector species and the available lowering path.

Conclusion

The value of explicit vectorization cannot be separated from the scalar implementation used to evaluate it.

The reported speedups illustrate this directly. At the reported irregular widths, the explicit vector kernel reaches approximately 4.7 times the throughput of the generic scalar decoder. Against the regular width-8 widening loop, the advantage falls to approximately 1.3 to 1.65 times. These ratios answer different questions because they compare the same vector kernel with scalar implementations that have substantially different execution costs. The larger ratio remains a valid measure of improvement over the generic implementation, but it is not an intrinsic measure of the advantage of explicit SIMD.

Two methodological consequences follow. First, a vector implementation should be compared with the strongest relevant scalar baselines, including compact, regular formulations that the JIT may compile to efficient machine code. A regular scalar baseline can remain highly competitive even when its specific widening idiom is not supported by SuperWord. Second, scalar specialization and vectorization should be evaluated independently. Removing runtime calculations through width-specific code does not by itself guarantee fewer instructions, higher throughput, or parallel machine code.

The Vector API is most compelling when the scalar formulation leaves substantial work on the hot path and prevents HotSpot from deriving sufficiently efficient machine code, not merely when auto-vectorization fails. The explicit kernel examined here retains a stable computational shape across widths up to 25, while the cost of the scalar implementations depends strongly on their extraction logic and generated machine code.

The widening experiment also shows that enabling SuperWord is not evidence that SIMD instructions were emitted. Compiler configuration, generated machine code, retired-instruction counts, and sampled-cycle profiles describe different aspects of execution and must be interpreted together.

Parquet bit unpacking is the concrete case examined here, but the same methodology applies to JVM kernels used in parsing, compression, encoding, hashing, and numerical processing.

References

Daniel Lemire and Leonid Boytsov. “Decoding billions of integers per second through vectorization.” Software: Practice & Experience, 45(1), pp. 1–29, 2015. DOI: 10.1002/spe.2203.
Apache Parquet. “PARQUET-2159: Parquet bit-packing de/encode optimization.” Apache Software Foundation issue tracker, https://issues.apache.org/jira/browse/PARQUET-2159. The optional vector path ships as a separate parquet-encoding-vector module; the version examined here is parquet-java 1.16.0.
OpenJDK. “JEP 338: Vector API (Incubator).” Initial Vector API incubation in JDK 16. https://openjdk.org/jeps/338
OpenJDK. “JEP 508: Vector API (Tenth Incubator).” Vector API incubation used in JDK 25. https://openjdk.org/jeps/508
OpenJDK. “JEP 454: Foreign Function & Memory API.” Finalized in JDK 22. https://openjdk.org/jeps/454
Daniel Lemire, Leonid Boytsov, and Nathan Kurz. “SIMD Compression and the Intersection of Sorted Integers.” Software: Practice & Experience, 46(6), pp. 723–749, 2016. DOI: 10.1002/spe.2326.
Christian Del Monte. decode-shape: benchmark, measurement, and verification harness for the experiments in this article. https://github.com/cdelmonte-zg/decode-shape
OpenJDK. “JDK-8375502: C2 SuperWord: implement unsigned casts.” HotSpot/C2 enhancement, open and unresolved at the time of writing (reported 2026-01-16). https://bugs.openjdk.org/browse/JDK-8375502

Introduction#

From bit-packed indices to integers#

How Java reaches SIMD#

Scalar execution and SIMD#

Two ways Java reaches vector instructions#

Why loop shape matters#

The explicit vector kernel#

Why the baseline matters#

Positioning the experiment#

Relation to existing work#

What this experiment examines differently#

Experimental setup#

Benchmark environment#

Scope and limitations#

The role of MemorySegment#

Input padding and boundary handling#

Measurement strategy#

Experimental provenance#

Results#

Baseline 1: a generic scalar decoder#

Baseline 2: a regular widening loop#

Why enabling SuperWord makes the scalar loop slower#

Why the widening loop is not vectorized#

Baseline 3: Parquet Java’s specialized scalar fallback#

The main performance result#

The width-26 transition#

Conclusion#

References#