Performance

On the JVM, optimizing a hot kernel is not only about writing faster code: it is also about understanding how much the result depends on the machine code HotSpot derives from the scalar loop. Using Parquet bit-unpacking as a concrete case, the piece shows that a SIMD speedup depends on which scalar baseline C2 is handed, when explicit vectorization is actually justified, and why a more specialized scalar routine is not necessarily faster.