Introduction
On the JVM, optimizing a hot kernel is not only about writing faster code: it is also about understanding how much the result depends on the machine code HotSpot derives from the scalar loop. Using Parquet bit-unpacking as a concrete case, the experiment shows that a SIMD speedup depends on which scalar baseline C2 is handed, when explicit vectorization is actually justified, and why a more specialized scalar routine is not necessarily faster.
The question explored is:
When does an explicitly vectorized bit-unpacking routine provide a real advantage over ordinary Java code optimized by HotSpot?
This is not only a question about Parquet decoding. It is also a question about how SIMD optimizations on the JVM should be evaluated.
A loop written as ordinary scalar Java does not necessarily stay scalar. HotSpot may recognize its structure and emit SIMD instructions, or leave it scalar when the access pattern is harder to prove.
The scalar baseline is therefore not a neutral reference point. It is part of the result.
From bit-packed indices to integers
Parquet dictionary encoding stores repeated values once and represents them in the data page through small integer indices. Those indices are bit-packed using only the number of bits the dictionary size requires: a dictionary of up to 8192 entries needs 13 bits per index. Reading the page therefore requires reconstructing an int[] of indices before the dictionary lookup can recover the original values.
This article isolates that reconstruction step:
bit-packed dictionary indices
↓
bit unpacking
↓
int[] of dictionary indices
To decode one value, a scalar decoder locates where it starts in the bit stream, loads the bytes around it, shifts away the preceding bits, applies a mask, and stores the resulting integer:
long bitPosition = 0;
for (int i = 0; i < count; i++) {
long bytePosition = bitPosition >>> 3;
int bitOffset = (int) (bitPosition & 7);
long word = source.get(LONG_LE, bytePosition);
output[i] = (int) ((word >>> bitOffset) & mask);
bitPosition += bitWidth;
}
The decoder processes one value at a time. At a width of 13 bits, the bit positions advance by a fixed increment, so the pattern is perfectly regular; but it is not a constant-stride, element-by-element array operation, and that is the distinction the compiler cares about. A more detailed introduction to Parquet dictionary encoding and bit unpacking is available separately, in From Dictionary Indices to Integers.
How Java reaches SIMD
Scalar execution and SIMD
A scalar loop processes one value after another:
value 0 → shift → mask → store
value 1 → shift → mask → store
value 2 → shift → mask → store
...
Modern processors can also apply one operation to several values at once. This is SIMD: Single Instruction, Multiple Data.
For example, on the machine used for this experiment, a 512-bit vector register can hold sixteen 32-bit integer lanes:
┌────┬────┬────┬────┬────┬────┬────┬────┐
│ v0 │ v1 │ v2 │ v3 │ .. │v13 │v14 │v15 │
└────┴────┴────┴────┴────┴────┴────┴────┘
A vector shift can operate on all those lanes. The same applies to masking and other arithmetic operations.
Java has two different ways to reach such instructions.
Two ways Java reaches vector instructions
The first route is auto-vectorization.
HotSpot contains an optimizing just-in-time compiler called C2. When a method becomes sufficiently hot, C2 compiles its bytecode into optimized machine code.
One of its optimization passes, called SuperWord, looks for isomorphic, independent scalar operations that can be packed into SIMD operations, provided their memory dependencies and access patterns can be proven safe.
Ordinary scalar-looking Java code can therefore become vector machine code.
For example:
for (int i = 0; i < count; i++) {
output[i] = source[i] & 0xFF;
}
The programmer has written a scalar loop. Each source-level iteration processes one byte.
C2 may nevertheless recognize that the iterations are independent and access memory in a regular stride, and emit machine code that processes several elements at once:
ordinary scalar Java loop
↓
C2 and SuperWord analyze the loop
↓
SIMD machine instructions, if recognition succeeds
The second route is the Vector API.
With the Vector API, the programmer expresses the vector computation directly. On supported hardware, and for operations the target can map efficiently, HotSpot lowers it to the matching SIMD instructions; otherwise the API stays functional but may fall back to a less efficient form:
Vector API operations
↓
HotSpot lowers the explicit vector computation
↓
SIMD instructions when supported efficiently
The Vector API did not introduce SIMD to Java. HotSpot had already been capable of generating vector instructions from suitable scalar loops for many years.
The difference is where the vector structure comes from. With auto-vectorization, C2 must infer that structure from scalar code. With the Vector API, the structure is explicit in the program.
This distinction leads to the central question:
When can C2 derive efficient machine code from ordinary scalar Java, and when does the vector structure need to be expressed explicitly through the Vector API?
Why loop shape matters
C2 does not vectorize every loop. It must be able to prove that the transformation is safe and recognize a pattern that it knows how to combine.
A byte-aligned conversion is comparatively simple:
for (int i = 0; i < count; i++) {
output[i] = source[i] & 0xFF;
}
Every iteration reads the next byte, applies the same mask, and writes the next integer:
source[0] → output[0]
source[1] → output[1]
source[2] → output[2]
The memory accesses have a fixed stride. The iterations are independent. The operation is identical for every element.
This is the kind of regular loop shape that gives SuperWord a plausible opportunity to form vector operations.
Bit unpacking at 13 bits has a different shape.
Successive values begin at changing positions:
value 0 starts at bit 0
value 1 starts at bit 13
value 2 starts at bit 26
value 3 starts at bit 39
The scalar decoder calculates a byte position and bit offset for every value:
long bytePosition = bitPosition >>> 3;
int bitOffset = (int) (bitPosition & 7);
long word = source.get(LONG_LE, bytePosition);
output[i] = (int) ((word >>> bitOffset) & mask);
bitPosition += bitWidth;
From the compiler’s perspective, the loop contains:
- input positions derived from a changing bit position;
- shift distances that vary between iterations;
- values that may cross byte boundaries;
- relationships between neighboring bytes that are not expressed as direct element-wise operations.
A programmer can see that several values could still be extracted in parallel. The compiler, however, would have to reconstruct that higher-level structure from the scalar operations.
The explicit Vector API kernel avoids that recognition problem by expressing the parallel structure directly.
The explicit vector kernel
The vector implementation processes a block of packed values in several stages:
load packed bytes
↓
rearrange bytes into vector lanes
↓
shift each lane by its required offset
↓
apply the bit mask
↓
store decoded integers
In simplified Vector API form:
ByteVector.fromMemorySegment(
BYTE_SPECIES,
source,
baseByte,
LITTLE_ENDIAN)
.rearrange(shuffle)
.reinterpretAsInts()
.lanewise(LSHR, shiftVector)
.lanewise(AND, maskVector)
.intoArray(output, outputOffset);
Three values depend on the bit width:
- the byte shuffle;
- the shift applied to each lane;
- the mask.
They are computed for the selected width and reused.
The sequence of operations does not otherwise change for widths up to 25 bits:
load → rearrange → shift → mask → store
At 5, 9, 13, 17, or 25 bits, the parameters change, but the kernel still performs the same classes of operation.
Its computational shape is therefore largely independent of the width.
That is the structural reason the vector cost stays nearly flat across widths, which the measurements confirm later.
Why the baseline matters
The term “scalar baseline” describes source code, not necessarily the quality or even the execution shape of the machine code produced from it.
Two loops written as ordinary scalar Java may lead HotSpot to generate substantially different code. A regular loop may compile into a compact and efficient scalar form, or in some cases into SIMD instructions. A more irregular loop may retain address calculations, variable shifts, and other per-value work that the compiler cannot reorganize effectively.
Comparing the Vector API only with the latter can make explicit SIMD appear overwhelmingly superior. Comparing it with the strongest machine code HotSpot can derive from a relevant scalar form shows how much advantage remains against a genuinely competitive baseline.
The experiment therefore uses several scalar baselines. It does not assume that a compiler-friendly source loop is successfully auto-vectorized; the generated machine code is itself part of the evidence.
Positioning the experiment
Relation to existing work
Vectorized integer unpacking is established work.
Lemire and Boytsov demonstrated that blocks of compressed integers can be decoded at very high throughput using explicit SIMD implementations. Their work helped establish a family of techniques based on fixed-width blocks, width-specific decoding strategies, vector rearrangement, shifts, and masks.
PARQUET-2159 brought this line of optimization directly into Parquet Java. Its benchmarks compared the existing Parquet unpacking implementation with routines written using the Java Vector API. The resulting work became the basis of Parquet Java’s optional vector implementation.
What this experiment measures differently
This article does not introduce a new unpacking technique. Instead, it examines how the measured advantage of explicit vectorization changes with the scalar baseline against which it is evaluated.
Previous work has established that explicit SIMD unpacking can outperform scalar implementations. In this experiment, however, the baseline itself is treated as part of the result. The comparison includes regular scalar loops whose performance is materially affected by the machine code HotSpot can derive from them, as well as irregular loops for which SuperWord provides no measurable benefit. This makes it possible to distinguish how much of the observed SIMD advantage comes from the explicitly vectorized kernel and how much comes from the particular scalar loop used as its baseline.
Machine-code and compiler-trace inspection add a further result: a regular scalar baseline can remain highly competitive without being auto-vectorized. For the byte-to-int widening loop, SuperWord constructs implementable load and store packs but rejects their def-use relationship because their vector element sizes differ and the implicit widening is not one of its recognized conversion idioms.
The experiment also evaluates Parquet Java’s specialized scalar routines separately from its vectorized path. The results show that scalar specialization does not necessarily produce a stronger baseline, particularly for wider and more irregular bit widths.
The central result is therefore not a faster unpacker. It is a measurement of the interaction between baseline selection, scalar specialization, the machine code HotSpot derives from different loop shapes, and explicit vectorization.
Experimental setup
Benchmark environment
The measurements come from a controlled, single-threaded microbenchmark on one machine:
AMD Ryzen 9 7950X3D
Zen 4
x86-64 Linux, little-endian
AVX-512 with VBMI/VBMI2
OpenJDK 25.0.3
Vector API, tenth incubator
512-bit preferred vector species
This configuration satisfies the platform and CPU-feature checks applied by ParquetReadRouter in the Parquet Java version examined here.
The benchmark measures in-cache decoding kernels.
Scope and limitations
It does not measure the complete cost of reading a Parquet file.
A real Parquet scan may also include:
- storage access;
- page reads;
- decompression;
- the dictionary lookup (gather into the dictionary page);
- null handling;
- filtering;
- memory allocation;
- scheduling;
- downstream execution;
- memory-bandwidth contention.
In a complete query, these costs may dominate bit unpacking.
The purpose of the microbenchmark is not to predict total query speed. It is to isolate the encoded-to-decoded boundary and make its computational shape measurable. The benchmark stops once it has produced the int[] of decoded indices; it does not include the subsequent dictionary lookup or the materialization of the original column values.
The results are also specific to one processor and one JDK.
A different machine or compiler version could change:
- the absolute throughput;
- the machine code C2 produces for the scalar baselines, including whether it auto-vectorizes them;
- the effectiveness of the specialized scalar routines;
- the magnitude or nature of the width-26 transition;
- the behavior of SuperWord on the widening loop.
The role of MemorySegment
Before comparing baselines, one possible confound must be removed. The experiment reads its input through a MemorySegment.
A MemorySegment is a bounded region of memory, on the heap or off-heap, introduced by the Foreign Function & Memory API finalized in JDK 22.
It would be easy to assume that MemorySegment is responsible for a large part of the measured speedup.
It is not.
In the benchmark, a scalar decoder reading from a heap byte[] and an equivalent decoder reading from an off-heap MemorySegment produced almost identical throughput.
This result is unsurprising. The two implementations use the same extraction algorithm, perform the same bit arithmetic, and operate on data that fits in cache. Only the source of the load changes.
MemorySegment is therefore not the performance result; its role is architectural. It supplies a bounded memory abstraction from which the Vector API can load directly, without requiring an intermediate Java array.
It is the substrate of the vector implementation, not its explanation.
Input padding and boundary handling
The decoders use fixed-width loads that may cross the logical end of the encoded input when processing its final values. The scalar decoders load one 64-bit word per value, while the main vector kernel loads a 64-byte ByteVector per block; its remaining values are handled by a scalar fallback.
To keep these loads within the allocated memory region without introducing a special end-of-buffer branch into the measured kernels, the benchmark appends 64 zero-filled bytes after the encoded data. The padding is sized for the widest load, the vector block.
The padding is therefore not an optimization available only to one side of the comparison; it keeps bounds handling out of the measured kernels entirely. All decoders operate under the same padded-input contract, so the measured advantage cannot be attributed to a scalar- or vector-specific boundary-handling path.
Experimental provenance
The reported throughput measurements come from one publication-grade run using:
3 forks
5 × 3-second warmup iterations
10 × 1-second measurement iterations
The processor governor was fixed to performance.
The hardware counters, including retired instructions per decoded value, were collected separately on the same host using shorter perfnorm passes:
1 fork
2 × 1-second warmup iterations
3 × 1-second measurement iterations
These counter passes supply the instruction-count metrics, not the throughput numbers.
The generated C2 machine code was inspected for the width-8 and width-16 widening loops with SuperWord enabled and disabled. Perfasm profiling was then used to attribute sampled cycles to instructions within the hot loops. This makes it possible to distinguish static code-shape differences from their dynamic cost. Cycle attribution remains approximate on out-of-order hardware.
The benchmark source, raw results, verification scripts, and extracted machine code are available in the accompanying repository. For all four widening configurations, the repository includes the decoder method across all compilation tiers, with the C2 hot loop identified. It also provides a script for building a JDK 25-compatible hsdis plugin and the exact command used to regenerate the complete disassembly.
The disassembly, SuperWord trace, and perfasm attribution were collected in separate follow-up runs on the same host and JDK. They explain the publication throughput measurements but are not part of the original measurement-of-record.
The publication profile and verification report can be reproduced with:
./run_lab.sh --publication
Interpreting the hardware counters
Throughput is the primary performance result, but it can vary with processor frequency, core placement, and other run-time conditions. Hardware counters provide complementary evidence by showing whether a throughput difference is accompanied by a substantial difference in the dynamic work performed by the implementations.
For this reason, the analysis also considers retired instructions per decoded value: the number of retired machine instructions attributed to the measured kernel, divided by the number of values decoded. It is not a count of source-level Vector API operations.
This metric is especially useful when its pattern remains stable across runs and bit widths. It indicates how much dynamic instruction work an implementation executes for every decoded value.
Instruction counts must not be interpreted as a direct prediction of throughput. Machine instructions differ in latency, throughput, and execution-resource requirements, and one vector instruction may perform a wider or more complex operation than one scalar instruction. Operations such as byte rearrangement and per-lane shifting may therefore retire relatively few instructions while still carrying a nontrivial execution cost.
The counters are consequently used as explanatory evidence rather than as a substitute for throughput measurements. Machine-code inspection identifies the instructions emitted by C2, while perfasm indicates where sampled cycles accumulate within the resulting hot loops.
Results
The three baselines answer progressively different questions. The first measures the advantage over a generic scalar decoder for which SuperWord provides no measurable benefit. The second gives the compiler a regular, byte-aligned loop and then inspects what HotSpot actually produces from it. The third compares the vector kernel with Parquet Java’s width-specialized scalar implementation.
Baseline 1: a generic scalar decoder
The first baseline is the generic scalar loop shown earlier.
It loads a word from a variable byte position, shifts it by a variable offset, applies a mask, and stores one decoded value.
The hardware counters show approximately:
15.2 instructions per decoded value
This value remains nearly constant across the tested widths.
Disabling SuperWord does not materially change the measurements. On the tested system, the loop therefore serves as the effectively scalar baseline.
Against this baseline, the explicit vector kernel reaches approximately 4.7 times the throughput.
That result is valid, but it answers a limited question:
Is explicit SIMD faster than a scalar loop for which SuperWord provides no measurable benefit?
The answer is yes, but this comparison establishes only the upper end of the observed advantage.
The comparison does not yet show how much advantage remains when the scalar code has a shape that the compiler can optimize more effectively.
Baseline 2: a regular widening loop
At an aligned width such as 8 bits, decoding can be expressed as a simple widening conversion:
for (int i = 0; i < count; i++) {
output[i] = source[i] & 0xFF;
}
This is a stronger scalar baseline than the generic unpacking loop. Its fixed-stride memory accesses and independent iterations give C2 a regular loop shape from which it can generate efficient machine code.
The widening loop is not presented as an alternative implementation for arbitrary-width unpacking. It serves as a compiler-friendly reference: it shows how much advantage explicit vectorization retains when the scalar operation has an almost ideal structure.
With the default compiler configuration, the loop processed approximately:
5,961 values per microsecond
The explicit vector implementation was approximately 1.65 times faster. This is a meaningful gain, but substantially smaller than the 4.7-fold advantage measured against the generic scalar decoder.
Disabling SuperWord produced an unexpected result. The same widening loop then processed approximately:
7,495 values per microsecond
The scalar loop was therefore faster with SuperWord disabled. Against this faster result, the advantage of the explicit vector implementation fell to approximately 1.3 times.
Machine-code inspection reveals the code-shape difference behind this regression. Neither configuration emits SIMD widening; there is no vector load-and-widen instruction in either form. With SuperWord disabled, C2 produces a compact loop unrolled eight ways, keeping each decoded value in a general-purpose register before storing it. With SuperWord enabled, C2 produces a loop unrolled sixteen ways, with more simultaneously live scalar values and several values routed through XMM registers as temporary storage and moved back before being stored. These transfers perform no SIMD work. The additional transfers and the larger scalar loop are consistent with both the increased retired-instruction count and the lower throughput. At width 8, perfasm attributes approximately 31% of the sampled cycles in the SuperWord-enabled hot loop to these scalar GPR–XMM round trips, which are absent from the disabled configuration. The transfers are therefore not merely correlated with the regression; they account for a substantial part of its dynamic cost.
A SuperWord diagnostic trace shows that the absence of SIMD is not caused by a failure to recognize the parallel structure or by missing target support. C2 constructs a 16-lane LoadUB pack and a 16-lane StoreI pack, and both pass the implementation filter.
The packs are then removed by a structural compatibility check, not by a numerical comparison of scalar and vector costs. SuperWord requires a packed definition and its use to have compatible vector element sizes, except for a small set of recognized widening idioms.
Here, LoadUB has a one-byte vector element type, while StoreI has a four-byte element type. Because the widening is implicit in the load rather than represented by one of the recognized idioms, SuperWord rejects the load-to-store relationship.
SuperWord therefore recognizes the parallel work and constructs individually implementable packs, but cannot connect them into an accepted vector def-use chain. The result is a missed vectorization opportunity rather than a target-instruction limitation or a numerical cost-model decision.
The diagnostic trace identifies the def-use compatibility rule as the deciding factor. A positive control rules out a general limitation of the machine or the SuperWord pipeline: an element-width-preserving int addition loop is vectorized on the same host and JDK, while the byte-to-int widening loop is not. In both cases, the individual packs are implementable; only the widening def-use edge is rejected.
| SuperWord stage | homogeneous a[i] + b[i] |
widening src[i] & 0xFF |
|---|---|---|
| recognizes parallel work | yes | yes |
| builds load and store packs | yes | yes |
| packs implementable on target | yes | yes |
| def-use element sizes compatible | yes (4B = 4B) | no (1B vs 4B) |
| outcome | vectorized | scalar |
None of this implies that auto-vectorization is generally harmful. It shows only that, for this loop, host, and JDK, the SuperWord-enabled compilation remains scalar and produces the less efficient of the two observed scalar forms.
The relevant comparison is therefore not between explicit SIMD and an auto-vectorized scalar loop. It is between explicit SIMD and the best machine code HotSpot produces from a regular scalar loop. Using the fastest observed scalar configuration, explicit vectorization retains an advantage of approximately 1.3 times.
Baseline 3: Parquet Java’s specialized scalar fallback
Parquet Java provides scalar unpacking routines specialized for individual bit widths. These routines are part of the core parquet-encoding module and form the commonly available fallback when the separate optional vector module is absent or not selected.
Instead of calculating the extraction position generically for every value, each routine encodes the required loads, shifts, masks, and combinations directly for one particular bit width.
This removes some runtime calculations, but it can also expand the decoding operation into a longer sequence of scalar instructions. At wider, irregular bit widths, a value may have to be assembled from several parts of the packed input, requiring additional loads, shifts, and bitwise combinations.
Specialization and vectorization are therefore separate optimizations. Specialization fixes the extraction pattern in advance; vectorization applies operations to several values in parallel. If C2 cannot reduce the expanded extraction sequence to sufficiently efficient machine code, the result may simply be a longer scalar instruction sequence than the generic decoder executes.
That is what the measurements indicate at several wide, irregular widths on the tested system. Parquet’s specialized scalar unpacker retires approximately 8 instructions per value at smaller widths, but up to approximately 23 at wider widths. Disabling SuperWord leaves the count essentially unchanged, providing no measurable evidence that SuperWord improves these routines on this system.
At widths 17 and 25, the specialized routines therefore retire more instructions and achieve lower throughput than the generic scalar decoder:
width 17:
Parquet specialized scalar: approximately 1,277 values/µs
generic scalar: approximately 1,992 values/µs
and:
width 25:
Parquet specialized scalar: approximately 1,004 values/µs
generic scalar: approximately 1,992 values/µs
Figure 1: throughput of the generic and specialized Parquet scalar decoders. At widths 17 and 25, the specialized implementation retires more instructions and falls below the generic loop.
The result is narrow. It does not mean that Parquet is generally slow. The comparison concerns only Parquet Java’s scalar fallback. The same library also ships a separate optional vector implementation, selected only on supported platform and processor configurations; that path is not being evaluated in this baseline.
Nor does the result show that per-width specialization is inherently harmful. Its effectiveness depends on the implementation, compiler, processor, and bit width.
The supported conclusion is:
On this host, Parquet’s per-width scalar specialization did not improve performance monotonically. At some wide, irregular widths, it was slower than the generic scalar loop.
The main result
Up to a bit width of 25, the explicit vector kernel retires approximately 1.57 instructions per decoded value, with little variation across the tested widths. The reason is structural: the kernel performs the same load → rearrange → shift → mask → store sequence at every width. Only the shuffle, shift vector, and mask change.
The scalar baselines do not show the same stability. Their cost depends on the shape of the loop and on the machine code C2 derives from it.
Figure 2: instructions per decoded value. The vector kernel stays nearly flat through width 25. The generic decoder remains at approximately 15.2 instructions per value, while Parquet’s specialized scalar cost increases with the width.
The measurements represent three different compiler situations rather than points on a single speedup curve. The inspected widening configurations are scalar; for the other scalar baselines, enabling SuperWord produced no measurable benefit.
At width 8, decoding is a regular widening operation with fixed-stride memory accesses. Depending on whether the default or the fastest measured scalar configuration is used, the explicit vector kernel is approximately 1.3 to 1.65 times faster. As Baseline 2 showed, neither configuration emits SIMD widening; the difference between them is the shape of the scalar code, not the presence of vector code.
Width 16 remains byte-aligned, but its scalar cost is already substantially higher. Machine-code inspection shows that both configurations remain scalar: neither emits a SIMD widening sequence. With SuperWord enabled, C2 produces a larger unrolled body with substantially more scalar loads and GPR–XMM transfers than the disabled configuration, which uses a smaller unrolled body and only limited XMM temporary storage.
The retired-instruction count changes only slightly, from approximately 6.53 instructions per value with SuperWord disabled to 6.38 with it enabled, yet throughput is lower with SuperWord enabled. The static instruction shape identifies additional register movement and a larger loop body, but perfasm shows that this is not the dominant cost: only a small fraction of sampled cycles is attributed to GPR–XMM transfers.
Most cycles fall on the scalar reconstruction itself: the or, zero-extending load, and shift operations that assemble each value dominate. The lower throughput with SuperWord enabled is therefore not explained by the XMM round trips. The difference is small and comparatively noisy, while the dominant cost remains the scalar load–shift–combine sequence required to reconstruct each 16-bit value. Byte alignment alone does not produce a scalar baseline as lean as the width-8 widening loop, and the explicit vector kernel is approximately 2.4 to 2.65 times faster.
At irregular widths such as 13, 17, and 25, enabling SuperWord provides no measurable benefit for the relevant scalar unpacking loops. At these widths the relevant baseline is the generic scalar decoder of Baseline 1, and against it the explicit vector kernel reaches approximately 4.7 times the throughput: the same upper-end figure reported there.
The larger speedup at irregular widths does not come from the vector kernel becoming faster. Its cost remains essentially constant. What changes is the cost and compiler behavior of the scalar baseline.
Figure 3: throughput by compiler situation. The ratios are not points on one continuous curve because the relevant scalar baseline differs between aligned and irregular widths.
The inspected widening configurations are scalar; none emits SIMD widening. What changes is the shape of the scalar code:
| Width | SuperWord | SIMD widening | Observed scalar code shape | Throughput | Instructions per value |
|---|---|---|---|---|---|
| 8 | disabled | none | 8-way unrolled hot loop; decoded values kept in GPRs | 7,495 | 2.64 |
| 8 | enabled | none | 16-way unrolled hot loop; several values routed through XMM | 5,961 | 3.01 |
| 16 | disabled | none | smaller scalar unrolled body; limited GPR–XMM transfers | 4,212 | 6.53 |
| 16 | enabled | none | larger scalar unrolled body; substantially more GPR–XMM transfers | 3,800 | 6.38 |
Throughput is in values per microsecond. At width 16 the default configuration retires slightly fewer instructions per value yet delivers lower throughput, a reminder that instruction count does not predict throughput proportionally.
Profiling separates that static code shape from its dynamic cost:
| Width | SuperWord | Dominant sampled-cycle cost |
|---|---|---|
| 8 | enabled | GPR–XMM round trips: ~31% of cycles |
| 16 | enabled | scalar reconstruction (or + movzbl + shl) dominates; GPR–XMM round trips minor |
At width 8, the additional XMM traffic is a major runtime cost; at width 16, it is secondary to the scalar reconstruction, which dominates. The same static anomaly is therefore performance-dominant at one width and secondary at another, which is why static disassembly, hardware counters, and cycle profiling were used together.
The width-26 transition
The stable vector result has a clear boundary.
Up to 25 bits, each value together with its maximum intra-byte offset fits inside a 32-bit working lane. On the benchmark host, a 512-bit vector can therefore process sixteen 32-bit lanes at once:
512 bits / 32 bits = 16 lanes
At width 26, a 32-bit working window is no longer sufficient. The kernel switches to 64-bit lanes:
512 bits / 64 bits = 8 lanes
This halves the number of values processed per vector operation. The decoded values must also be narrowed from 64-bit lanes back to Java int values.
The measured instruction cost rises from approximately 1.57 to 2.88 instructions per value, while throughput falls by about 1.4 times.
The two ratios are not expected to match. The additional narrowing operation and the use of 64-bit rather than 32-bit lanes have different execution costs and may overlap differently within the processor. Instruction count identifies the change in computational shape, but it does not predict throughput proportionally.
This transition differs from the scalar behavior at irregular widths. Scalar performance depends on the machine code C2 can derive from each loop shape. The transition at width 26, by contrast, follows directly from the design of the vector kernel: the 32-bit working window is no longer sufficient, so the implementation switches to 64-bit lanes, processes fewer values per vector, and adds a narrowing step before storing the results.
The performance impact of this transition depends on the preferred vector species and the implementation path available on a given platform. Systems with narrower vectors may exhibit a different cost above 25 bits and may use another vector strategy or fall back to scalar decoding.
What the speedups mean
It would be tempting to summarize the experiment as:
The Vector API makes bit unpacking five times faster.
That statement is valid only against the generic scalar decoder, for which SuperWord provides no measurable benefit on the tested system.
Against the regular width-8 widening loop, the advantage is approximately 1.3 to 1.65 times. Against the irregular scalar unpacking loops, it rises to approximately 4.7 times.
These results are not contradictory. They compare the same vector kernel with scalar baselines that exhibit substantially different instruction counts and throughput characteristics.
A SIMD speedup is therefore meaningful only when the scalar baseline gives the JIT a realistic opportunity to optimize the loop. Otherwise, a large ratio may reflect the limitations of the comparison code as much as the strength of the vector implementation.
Optimization here does not necessarily mean auto-vectorization. In the width-8 baseline, the strongest scalar result remains entirely scalar: the default configuration does not auto-vectorize the widening loop on this system, and its lower throughput comes from a less efficient scalar code shape, not from the cost of successful SIMD generation.
Conclusion
The experiment shows that the value of explicit vectorization cannot be separated from the scalar code used to evaluate it.
Two methodological consequences follow. First, a vector implementation should be compared with the strongest relevant scalar baseline, including forms deliberately structured to expose optimization opportunities to the JIT. The strongest scalar baseline need not be auto-vectorized; a compact scalar loop may already be highly competitive. Second, scalar specialization and vectorization should be evaluated independently. Removing runtime calculations through width-specific code does not by itself guarantee fewer instructions, greater throughput, or parallel machine code.
The measurements also identify where the Vector API is most useful. It is most compelling when the scalar representation prevents HotSpot from producing sufficiently efficient machine code, not merely when auto-vectorization fails. A regular scalar loop may remain scalar and still approach the explicit vector implementation, while an irregular extraction kernel may leave much more work on the hot path. In those cases, explicit vectorization can make both the intended execution shape and its performance more stable across input forms. The widening experiment also demonstrates that enabling an auto-vectorization pass is not evidence that SIMD was emitted: the generated machine code must remain part of the evaluation.
Parquet bit unpacking is the concrete case examined here, but the same methodology applies to JVM kernels used in parsing, compression, encoding, hashing, and numerical processing.
References
- Daniel Lemire and Leonid Boytsov. “Decoding billions of integers per second through vectorization.” Software: Practice & Experience, 45(1), pp. 1–29, 2015. DOI: 10.1002/spe.2203.
- Apache Parquet. “PARQUET-2159: Parquet bit-packing de/encode optimization.” Apache Software Foundation issue tracker, https://issues.apache.org/jira/browse/PARQUET-2159. The optional vector path ships as a separate
parquet-encoding-vectormodule; the version examined here is parquet-java 1.16.0. - OpenJDK. “JEP 338: Vector API (Incubator).” Initial Vector API incubation in JDK 16. https://openjdk.org/jeps/338
- OpenJDK. “JEP 508: Vector API (Tenth Incubator).” Vector API incubation used in JDK 25. https://openjdk.org/jeps/508
- OpenJDK. “JEP 454: Foreign Function & Memory API.” Finalized in JDK 22. https://openjdk.org/jeps/454
- Daniel Lemire, Leonid Boytsov, and Nathan Kurz. “SIMD Compression and the Intersection of Sorted Integers.” Software: Practice & Experience, 46(6), pp. 723–749, 2016. DOI: 10.1002/spe.2326.
- Christian Del Monte. decode-shape: benchmark, measurement, and verification harness for the experiments in this article. https://github.com/cdelmonte-zg/decode-shape