Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Before I can be bullish on RISC-V, I really need to see a set of basic bitwise instructions -- minimally, popcount and count-leading-zeros (i.e. integer log-base-2).

When I last looked at the (tabled) bitwise extensions, they had been bloated out of all reason with really wild bit-shuffling apparatus.

I understand those last are very useful to some people, but if they interfere with getting a fast popcount in the core ISA, we can't afford them.



The B extension isn't tabled, it's active again:

https://github.com/riscv/riscv-bitmanip


That is really an admirable document. I learned a lot, about way more than just the RISC-V.


I would like at least popcount (or, alternatively, the "sideways add" in MMIX, which does popcount(x&~y)) and "multiple or" (another instruction in MMIX).


I think that the generalized reverse instruction could have fit just fine, because the circuit that computes it is also capable of the complete set of shifts and rotations. Not sure about that generalized zip, though.

clz/ctz are pretty useful as priority decoders that show up in a variety of applications.

Carryless multiplication can subsume all of the CRC's.

What are the users for popcount?


> What are the users for popcount?

https://en.wikipedia.org/wiki/Hamming_distance#History_and_a...

EDIT: What might not have been obvious is that popcount combines with other operations well. E.g. popcount(A ^ B) gives the number of bits that differ. popcount(A & B) gives the number of bits that are the same. popcount(A) gives the number of options set in a flag field. etc., etc.


popcnt is all over the place once you start doing any interesting bitwise stuff at all.


Can you be more specific?


I've used it for bitwise hamming distance - xor+popcnt - having a hardware instruction on x86 made the entire pipeline over 6x faster.


I’ve used popcounts for full vectorization for b-bit minwise hashing, comparing bloom filters efficiently, and bitvector traversal. It’s a standard primitive in a lot of low-level code.


What's the rush? The instructions can be emulated in the meantime at a passable speed.


If they could be emulated at passable speed, I would not need them.


Could you be more specific? What are you doing on a RISC-V board where emulated popcnt loses more than 10-20% of your performance?


The whole point of Risc is to avoid instruction complexity for rarely used instructions. Popcount and family are most likely microcoded to expand to a loop anyway, so why add this complexity to silicon?


I checked the latency and throughput of the SSE4 POPCNT and LZCNT instructions on a bunch of microarchitectures [1], and can't think of any way to get that kind of performance with a combination of simpler instructions. These are occasionally very useful operations, and I really do want them to be fast.

[1] https://www.agner.org/optimize/instruction_tables.pdf


I don't know why you're being downvoted, because you are correct. It's an explicit goal of RISC-V to support complex operations by using a mixture of macro-op fusion and extensions, rather than trying to encode everything in the base architecture.

Source for macro-op fusion claim: Celio, Christopher, et al. "The Renewed Case for the Reduced Instruction Set Computer: Avoiding ISA Bloat with Macro-Op Fusion for RISC-V." arXiv preprint arXiv:1607.02318 (2016).

Source for extension claim: https://en.wikipedia.org/wiki/RISC-V#ISA_base_and_extensions (Note "B" extension for bit ops)


> The whole point of Risc is to avoid instruction complexity for rarely used instructions

Yeah, that was true 20-30 years ago.

> Popcount and family are most likely microcoded to expand to a loop anyway, so why add this complexity to silicon?

No, you can just make a network of small adders. You could easily build a CPU that has throughput of 1 popcount per cycle (or even better, if SIMD and/or multiple execution ports supported this ALU operation). Silicon area is not a problem anyways anymore. Power and heat on the other hand...


>Silicon area is not a problem anyways anymore. Power and heat on the other hand...

Yes, so then why are ARM chips in mobile devices approaching x86 performance with an order of magnitude less power? C̶o̶u̶l̶d̶ ̶i̶t̶ ̶b̶e̶ ̶t̶h̶a̶t̶ ̶h̶a̶v̶i̶n̶g̶ ̶a̶ ̶l̶a̶r̶g̶e̶ ̶c̶h̶u̶n̶k̶ ̶o̶f̶ ̶r̶a̶r̶e̶l̶y̶-̶u̶s̶e̶d̶ ̶s̶i̶l̶i̶c̶o̶n̶ ̶e̶n̶d̶s̶ ̶u̶p̶ ̶c̶o̶n̶s̶u̶m̶i̶n̶g̶ ̶s̶o̶m̶e̶ ̶s̶i̶g̶n̶i̶f̶i̶c̶a̶n̶t̶ ̶a̶m̶o̶u̶n̶t̶ ̶o̶f̶ ̶p̶o̶w̶e̶r̶,̶ ̶b̶e̶c̶o̶m̶i̶n̶g̶ ̶d̶o̶m̶i̶n̶a̶n̶t̶ ̶a̶t̶ ̶e̶x̶c̶e̶s̶s̶i̶v̶e̶l̶y̶ ̶l̶a̶r̶g̶e̶ ̶s̶c̶a̶l̶e̶s̶?̶


That’s not the reason, especially not in this day and age of clock gating. It’s because those ARM designs are optimized for low power, at the expense of not having headroom to take advantage of higher power envelopes if they are available.


Die area does impact latency, and by including popcount, you would be increasing gate count by a fraction of a full multiplier.


The number of gates required for popcount scales linearly with the input size. It's much, much smaller than a multiplier.


No, it would scale logarithmically if you want to minimize latency.


It cannot scale logarithmically because you need at least a linear number of gates to represent the inputs.

It doesn't have to scale superlinearly either, because the natural circuit using a linear number of adders is already of a logarithmic depth, which is best possible asymptotically.


The gate depth scales logarithmically, and that determines how many cycles it takes, or (worse) how fast your cycles can be, if you want the answer in one cycle.


A lot of complexity could be eschewed if processors weren't faster than 100 Mhz.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: