Huh, it looks like that only works on 1-byte values? That’s an interesting choic...

m0zg · on Sept 9, 2019

Worse, it's a fertile ground for "interesting" bugs, because VADDV (which sum-reduces the result) reduces into an 8 bit uint. So if you e.g. accumulate two or more quadword VCNTs into a uint8x16_t and then VADDV it, you could end up with something other than the actual overall bit count (because 2 quadwords can have _256_ bits set). Same with accumulating 8 or more VADDVs, except now individual bytes could wrap around if you don't widen in between.