Before I can be bullish on RISC-V, I really need to see a set of basic bitwise instructions -- minimally, popcount and count-leading-zeros (i.e. integer log-base-2).
When I last looked at the (tabled) bitwise extensions, they had been bloated out of all reason with really wild bit-shuffling apparatus.
I understand those last are very useful to some people, but if they interfere with getting a fast popcount in the core ISA, we can't afford them.
I would like at least popcount (or, alternatively, the "sideways add" in MMIX, which does popcount(x&~y)) and "multiple or" (another instruction in MMIX).
I think that the generalized reverse instruction could have fit just fine, because the circuit that computes it is also capable of the complete set of shifts and rotations. Not sure about that generalized zip, though.
clz/ctz are pretty useful as priority decoders that show up in a variety of applications.
Carryless multiplication can subsume all of the CRC's.
EDIT: What might not have been obvious is that popcount combines with other operations well. E.g. popcount(A ^ B) gives the number of bits that differ. popcount(A & B) gives the number of bits that are the same. popcount(A) gives the number of options set in a flag field. etc., etc.
I’ve used popcounts for full vectorization for b-bit minwise hashing, comparing bloom filters efficiently, and bitvector traversal. It’s a standard primitive in a lot of low-level code.
The whole point of Risc is to avoid instruction complexity for rarely used instructions. Popcount and family are most likely microcoded to expand to a loop anyway, so why add this complexity to silicon?
I checked the latency and throughput of the SSE4 POPCNT and LZCNT instructions on a bunch of microarchitectures [1], and can't think of any way to get that kind of performance with a combination of simpler instructions. These are occasionally very useful operations, and I really do want them to be fast.
I don't know why you're being downvoted, because you are correct. It's an explicit goal of RISC-V to support complex operations by using a mixture of macro-op fusion and extensions, rather than trying to encode everything in the base architecture.
Source for macro-op fusion claim: Celio, Christopher, et al. "The Renewed Case for the Reduced Instruction Set Computer: Avoiding ISA Bloat with Macro-Op Fusion for RISC-V." arXiv preprint arXiv:1607.02318 (2016).
> The whole point of Risc is to avoid instruction complexity for rarely used instructions
Yeah, that was true 20-30 years ago.
> Popcount and family are most likely microcoded to expand to a loop anyway, so why add this complexity to silicon?
No, you can just make a network of small adders. You could easily build a CPU that has throughput of 1 popcount per cycle (or even better, if SIMD and/or multiple execution ports supported this ALU operation). Silicon area is not a problem anyways anymore. Power and heat on the other hand...
>Silicon area is not a problem anyways anymore. Power and heat on the other hand...
Yes, so then why are ARM chips in mobile devices approaching x86 performance with an order of magnitude less power? C̶o̶u̶l̶d̶ ̶i̶t̶ ̶b̶e̶ ̶t̶h̶a̶t̶ ̶h̶a̶v̶i̶n̶g̶ ̶a̶ ̶l̶a̶r̶g̶e̶ ̶c̶h̶u̶n̶k̶ ̶o̶f̶ ̶r̶a̶r̶e̶l̶y̶-̶u̶s̶e̶d̶ ̶s̶i̶l̶i̶c̶o̶n̶ ̶e̶n̶d̶s̶ ̶u̶p̶ ̶c̶o̶n̶s̶u̶m̶i̶n̶g̶ ̶s̶o̶m̶e̶ ̶s̶i̶g̶n̶i̶f̶i̶c̶a̶n̶t̶ ̶a̶m̶o̶u̶n̶t̶ ̶o̶f̶ ̶p̶o̶w̶e̶r̶,̶ ̶b̶e̶c̶o̶m̶i̶n̶g̶ ̶d̶o̶m̶i̶n̶a̶n̶t̶ ̶a̶t̶ ̶e̶x̶c̶e̶s̶s̶i̶v̶e̶l̶y̶ ̶l̶a̶r̶g̶e̶ ̶s̶c̶a̶l̶e̶s̶?̶
That’s not the reason, especially not in this day and age of clock gating. It’s because those ARM designs are optimized for low power, at the expense of not having headroom to take advantage of higher power envelopes if they are available.
It cannot scale logarithmically because you need at least a linear number of gates to represent the inputs.
It doesn't have to scale superlinearly either, because the natural circuit using a linear number of adders is already of a logarithmic depth, which is best possible asymptotically.
The gate depth scales logarithmically, and that determines how many cycles it takes, or (worse) how fast your cycles can be, if you want the answer in one cycle.
When I last looked at the (tabled) bitwise extensions, they had been bloated out of all reason with really wild bit-shuffling apparatus.
I understand those last are very useful to some people, but if they interfere with getting a fast popcount in the core ISA, we can't afford them.