Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Any media processing can generally gain significant performance with SIMD instructions

To a limit. JPEG (and many video codecs based on JPEG) have 8x8 macroblocks, which means the "easiest" SIMD-parallel is 64-way. And AVX512 taken 8-bits at a time is in fact, 64-way SIMD.

To get further parallel processing after that, you'll probably have to change the format. GPUs go up to 1024-way NVidia blocks (or AMD Thread groups), which are basically SIMD-units ganged together so that thread-barrier instructions can keep them in sync better. 1024-work items corresponds to a 32x32 pixel working area.

But that's no longer the format of JPEG. It'd have to be some future codec. Maybe modern codecs are seeing the writing on the wall and are increasing macroblock size for better parallel processing 10 years into the future (they are a surprisingly forward looking group in general).



> Maybe modern codecs are seeing the writing on the wall and are increasing macroblock size for better parallel processing 10 years into the future

We did indeed do this for JPEG XL - the future is now :) 256x256 pixel groups are independently decodable (multi-core), each with >= 64-item (float) SIMD.


AVX-512 lines up with 64-byte cache lines, it seems like it would be a huge change to go bigger.


NVidia GPUs are 32 wide warps, AMD CDNA are 64 wide. That's 1024 bit and 2048 bit respectively.

Cache lines are probably 64 wide for the purpose of burst length 8 (64 bit burst length 8 is 64 bytes / 512 bits).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: