> Any media processing can generally gain significant performance with SIMD instructions
To a limit. JPEG (and many video codecs based on JPEG) have 8x8 macroblocks, which means the "easiest" SIMD-parallel is 64-way. And AVX512 taken 8-bits at a time is in fact, 64-way SIMD.
To get further parallel processing after that, you'll probably have to change the format. GPUs go up to 1024-way NVidia blocks (or AMD Thread groups), which are basically SIMD-units ganged together so that thread-barrier instructions can keep them in sync better. 1024-work items corresponds to a 32x32 pixel working area.
But that's no longer the format of JPEG. It'd have to be some future codec. Maybe modern codecs are seeing the writing on the wall and are increasing macroblock size for better parallel processing 10 years into the future (they are a surprisingly forward looking group in general).
> Maybe modern codecs are seeing the writing on the wall and are increasing macroblock size for better parallel processing 10 years into the future
We did indeed do this for JPEG XL - the future is now :) 256x256 pixel groups are independently decodable (multi-core), each with >= 64-item (float) SIMD.
To a limit. JPEG (and many video codecs based on JPEG) have 8x8 macroblocks, which means the "easiest" SIMD-parallel is 64-way. And AVX512 taken 8-bits at a time is in fact, 64-way SIMD.
To get further parallel processing after that, you'll probably have to change the format. GPUs go up to 1024-way NVidia blocks (or AMD Thread groups), which are basically SIMD-units ganged together so that thread-barrier instructions can keep them in sync better. 1024-work items corresponds to a 32x32 pixel working area.
But that's no longer the format of JPEG. It'd have to be some future codec. Maybe modern codecs are seeing the writing on the wall and are increasing macroblock size for better parallel processing 10 years into the future (they are a surprisingly forward looking group in general).