This is the sort of thing that shows why the whole "pure RISC philosophy" of implementing only the simplest instructions is ultimately a pretty dead-end in processor design. CISC-style dedicated instructions and hardware which utilises them will always be more efficient than a pure software implementation using just "simple RISC instructions", because hardware can easily express specialised computations like the extreme parallelism of algorithms such as SHA while software implementations essentially rely on the instruction-scheduling mechanisms to arrange potentially hundreds or even thousands of instructions in an optimal order.
In addition, when there is dedicated hardware, it is also significantly easier to add a SHA instruction than attempt to recognise sequences of many regular instructions that could be "fused together" into a single operation using that hardware. When there isn't, adding such an instruction that gets internally expanded into multiple uops for the equivalent using the existing functional units is still beneficial, since it leaves open room for an immediate performance gain on all existing software when a future revision does add that dedicated hardware, or optimises its microarchitecture to allow those (possibly different) uops to execute faster.
Despite the name, ARM is definitely not very RISC anymore, and that's what has kept it competitive.
> CISC-style dedicated instructions and hardware which utilises them will always be more efficient than a pure software implementation using just "simple RISC instructions"
For a counterexample, see the x86 string instructions. "Hardware" implementations like `repne scasb` are routinely outperformed by software implementations using SSE2.
Another problem is that these instructions don't die. SHA will some day be replaced, but the instruction will live on. The x86 BCD instructions illustrate this.
From what I remember, RISC is less about "simple" instructions, and more about regular instructions which can execute with predictable throughput, ideally one per clock cycle. "Do this particular arithmetic" is in the RISC philosophy, while looping instructions (rep, lswi, etc.) are not.
For a counterexample, see the x86 string instructions. "Hardware" implementations like `repne scasb` are routinely outperformed by software implementations using SSE2.
That's only because SCAS (and CMPS) has not (yet) received quite the same amount of attention as MOVS and STOS. For a counterexample to your counterexample, look up "enhanced REP MOVSB". REP MOVS/STOS can operation on cacheline-sized blocks since at least the P6, when it was introduced as the "fast strings" feature, and its performance has been steadily improved over the processor generations.
Another problem is that these instructions don't die. SHA will some day be replaced, but the instruction will live on. The x86 BCD instructions illustrate this.
Replaced for secure crypto, yes, but there are plenty of other applications like (nonmalicious) data corruption detection where a reasonably fast yet far more collision-resistant algorithm than regular CRC is very useful.
I think RISC-V is showing that RISC ISAs can still compete in this space: https://arxiv.org/pdf/1607.02318v1.pdf. RISC-V has extensibility of the ISA as a core feature, allowing for special-purpose accelerators and coprocessors as needed.
Or there is the middle ground of retaining the general purpose nature by making the "dedicated" hardware an FPGA. Potentially the best of both worlds. Xlinx has the Zynq (ARM+FPGA), Intel has bought Altera and is now doing CPUs with FPGAs in the package [2].
This can work for special-purpose projects, but might become problematic in general: reprogramming the FPGA can be slow (on the order of 10s of milliseconds) and you don't necessarily want to include FPGA state as part of task state in the OS. You also need to be careful about thermal limits and power draw (especially when reconfiguring). I doubt current-generation FPGAs will land in anything smaller than a server for a long time. There, you can have semi-permanent accelerators according to the server's role, and a larger thermal budget.
This article is referring specifically to ARM64 assembly which is actually quite a bit more RISC-y than previous versions of the ARM ISA. In favor of simplicity, many of the more complex ARMv7 instructions (such as the block loads/stores) were removed in ARMv8. I'd still agree that ARM is not very RISC anymore, but the ongoing move to ARM64 takes it quite a bit back in that direction.
Also in sync with the micro-coded CPUs of Burroughs, Xerox PARC, ETHZ, Genera and so on, even if they eventually lost to more general purpose CISC instructions.
"Interestingly enough, there are actually comparable Intel SHA extensions to the ARM equivalents. Linux 4.4 has added support for this but so far we have not been able to identify any CPUs that will actually run this code."
Intel Goldmont is the only microarch that implements the instructions. It was "released" in April: http://www.extremetech.com/computing/226800-intels-new-low-c...
But it is one of these annoying soft launches where the processors are not for sale anywhere, the specs are incomplete, and even ark.intel.com doesn't know about them. sigh
The original one, that I'm aware, was VIA's C3 and such processors with Padlock Encryption. Came in the little Artigo computers I used in designs that gave me, for $300, a 1GHz processor at 25W, hypervisor support, crypto acceleration (incl RSA), and a TRNG. Bad little systems years ago when I was into them.
Huh you seem right. Only the multi-socket Haswell seems to have them (E7-xxxxv3 and E5-26xxv3).
And now Intel re-introduce the SHA instructions in some low-power low-end CPUs, but not for any desktop or single-socket server CPU? What a bizarre case of feature fragmentation. Typical Intel.
I think even Intel is confused about their own extensions at this point. The E7 and E5 Xeons are Haswell systems that slightly speed up SHA-256 and 512 with the introduced BMI2 RORX instruction.
As far as I know there are no SHA extensions before Goldmont, as mrb said, and in Cannonlake for non-low-power chips.
No, there are no algorithm changes (otherwise the results would be different), it is just taking advantage of the ARM SHA accelerations when they are available.
I a change in algorithm does not necessarily lead to change in results. Two algorithms that produce the same results may be disparate, but I suppose it's up to your definition of disparate
>there are no algorithm changes (otherwise the results would be different
Insertionsort, Mergesort and Timsort are three wildly different algorithms with different speeds, but on every possible input they produce the exact same result
Strictly speaking, it's possible for different sorting algorithms to produce different results if you sort using a weak order rather than a total order.
Did you compare performance to SHA512? Despite being a theoretically more secure/"harder" algorithm, on 64 bit platforms it can sometimes be faster than SHA256. If you don't want to use 512 bits, using 256 bits of the output of SHA512 is standardized as SHA512/256 and is considered valid/secure.
(I'm unclear if this performance oddity remains true with the crypto hardware extensions being used here.)
No we did not compare to SHA512 (since we don't need it).
I would expect SHA512 to be 2x faster compared to the SHA256 software version, but that is still way slower than the ARM SHA extensions accelerated version.
The oddity is that I believe sha512 uses 64-bit words while sha256 uses 32-bit words. So on 64 bit hardware sha512 will be faster (presumably because using the 32 bit registers is slower somehow, I don't know). This is why ZFS is adding support for sha512/256 as it's Merkel tree hashing function.
SHA512 eats data 512 bits at a time, while SHA256 eats it 256 bits at time. Both internally use 8 "registers", which are either 64 or 32 bits wide. Assuming you have the hardware registers to match, this would make SHA512 about twice as fast. But internally, it mixes data using 80 rounds, vs 64 for SHA256, so the speedup isn't quite 2x.
That fits with my understanding. What I don't know is how this interacts with the acceleration functionality the chip has. For instance, do the instructions end up being "please do sha256 on this data", or are they closer to AES-NI where it's "please perform one round of an AES encryption flow".
Judging from the source code[1] in the author's repository, it is the latter. (what I see is: perform 96 instructions, then subtract 64 from the message length, and repeat until the message length is 0)
Intel seems to be the same way [2](2013) - i.e. looping over multiple instructions per 64 byte block.
SHA-512 should be much faster than SHA-256 on AVX2-capable processors. Go's implementation is subpar; try OpenSSL instead. On Skylake, SHA-512 is about as fast as MD5 now -- https://bench.cr.yp.to/results-hash.html#amd64-skylake
Most of it is written to take advantage of special hardware instructions in CPUs. If you take OpenSSL (a C library), it's got many assembly code paths as well.
I'm not sure why this should be a problem, it just shows attention to performance in my opinion.
Only if the compiler is aware of and can translate code into e.g. ARM-specific sha instructions. If not, then the developer is still smarter than the compiler and should use assembly to use those specific instructions.
Is this a general thing or a Go specific one? This generic statement seems to be false given how many optimizations that actually need assembly exist in Go and not only.
From a bird's eye view, adjust the compiler until it outputs the assembly you would have written. Hopefully this is done in such a way that such adjustments can be cross-beneficial across platforms, and can be easily created.
That's wishful thinking. GCC isn't able to do that with C language, a 20 years old compiler with a 40 years language. How would a compiler know that you can use a SHA2 instruction? Yes, there are intrinsics but then what about specialized vectorized instructions which don't match any primitive types or operations in the language (like, you can't express "add with carry" in C or Go)? And how can you explain a compiler that there is no data dependency in memory access between different buffers? Yes it might be able to prove it by itself in some cases, but history has shown that things like autovectorization are too fragile and can't be relied upon.
Optimized C libraries like compressors or security have assembly implementations as well. I'm not sure why using assembler should be considered a signal of a defect rather than a proof of optimization. There's really nothing wrong in using assembly to write a bytes.Index version optimized for AVX2.
"That's wishful thinking. GCC isn't able to do that with C language, a 20 years old compiler with a 40 years language. How would a compiler know that you can use a SHA2 instruction?"
Well, actually, they could, it's just not worth pattern matching because it occurs so infrequently.
"but history has shown that things like autovectorization are too fragile and can't be relied upon."
Errr, i'd say the opposite. History has shown that good autovectorizing compilers can come pretty damn close to whatever you want.
Usually the only issue is that they may insert too many runtime checks, and that's easily solvable.
Can you show me an opensource library implementing performance-sensitive computations in any field (graphics, numerics, crypto, etc.) which relies on autovectorization to obtain the performance characteristics that clients rely upon?
The large majority of performance-sensitive libraries I know of rely on carefully written native code (C, C++), with a mixture of assembly and/or intrinsics. This, to me, is a failure to concretize autovectorization from textbooks into an industry-accepted solution.
> How would a compiler know that you can use a SHA2 instruction?
Intrinsics.
There's basically zero benefit to coding this stuff in assembly. Many standard libraries prefer intrinsics, because there's basically zero benefit to writing this stuff in assembly and many drawbacks from an optimization perspective.
Because the Go assembler doesn't have those instructions built in most likely.
That sort of thing is quite common in Go assembler. The assemblers are a little primitive and they force the assembly language for each processor into a common pattern, so when you are writing ARM assembler for instance, everything is backwards.
I imagine having rationalised the assemblers between processors somewhat it makes the core developers lives easier though who have to lightly touch lots of different processor assembler.
The Golang assembly does not support all instructions, we've created a tool (see https://github.com/minio/asm2plan9s) to assist in translating instructions into their opcode equivalents.
We're not supposed to write comments like this on "Show HN" posts. This is Apache-licensed open source code. You can criticize it on its own terms, but tendentious criticisms of the motivations of its authors are off-topic for Show HN. The rule can/should be boiled down to: "First, respect new work."
This is the plan, which will happen eventually right now we are keeping it out there for community comments and improvement in quality that might be needed before it is submitted upstream.
SHA2 just got even faster, so it illustrates rather well that you wouldn't want to use it for password storage -- of course that was always true, but the improvement drives that point home.
Even if the ARM processor on a cycle-by-cycle basis keeps up with already-existing GPU SHA2 crackers (which I do not know), it will not keep up with the parallelism in the GPUs, which is easy for anybody to get through hashcat: https://hashcat.net/wiki/
SHA2 has not gotten faster. It was already blisteringly, unbelievably fast.
If you're using any hash directly for password storage you're doing it wrong, regardless of the specifics.
PBKDF2-SHA2 with a reasonably high amount of rounds (>100000) is still okay-ish, although you should have transited to bcrypt years ago. And hopefully we'll see argon2 stabilizing and getting adoption soon™.
It's fine to use a SHA2 variant HMAC in PBKDF2 with an obscene, ever-increasing number of rounds, but the scrypt paper [1] provides a detailed justification why you may want to switch to a dedicated password hashing function regardless. To quote their website [2]:
"We estimate that on [circa-2009] hardware, if 5 seconds are spent computing a derived key, the cost of a hardware brute-force attack against scrypt is roughly 4000 times greater than the cost of a similar attack against bcrypt (to find the same password), and 20000 times greater than a similar attack against PBKDF2."
Specifically, PBKDF2 is a tool to produce mostly-okay password storage hash from crypto primitives that are totally unfit for it, while bcrypt, scrypt, Argon2 are purpose-designed to make certain kinds of attacks difficult, like time-memory tradeoffs, parallelized custom hardware attacks, etc.
Almost right. bcrypt is not designed to be secure against ASICs; it requires a fixed circuit size. In fact, given that CPU acceleration is available for SHA2, I suspect that PBKDF2-SHA256 is now stronger than bcrypt based on the "if my server spends X seconds hashing this password, how much money will someone need to spend to crack it" metric.
I waffle about with this metric. If it's cheap to buy a server with some feature, it's obviously cheap for an attacker to buy the same. Widespread SHA2 hardware means that SHA2 cracking hardware is also available to even the poorest of crackers. I think there's a sliding scale, where the ratio changes for some attackers but not for all.
Right, sophisticated attackers have had custom SHA256 circuits for a long time; less sophisticated attackers are only gaining them now that they are present in CPUs. But if you're defending against such less sophisticated attackers, you're still not losing anything; worst case, SHA256 instructions in CPUs give them the same speedup as you're getting so their cost remains fixed.
The problem with using iterated SHA2 for passwords is that GPU/FPGA/ASIC are orders of magnitude faster than the CPU-based "normal" usage. So while increasing the number of rounds makes it more difficult for an attacker, they already have an edge by being able to use faster technologies.
Ideally you'd use something like Argon2 or any of the other finalists from the Password Hashing Competition[1].
I don't know what you mean by this. I understand that
SHA2 is poor for some things, but I'm not sure how this
illustrates that.
You want password hashes to be very slow, (and to use a lot of RAM) to make brute force attacks difficult. SHA2 is fast, and having hardware support makes it even faster. This makes it a bad choice for a password hash.
In addition, when there is dedicated hardware, it is also significantly easier to add a SHA instruction than attempt to recognise sequences of many regular instructions that could be "fused together" into a single operation using that hardware. When there isn't, adding such an instruction that gets internally expanded into multiple uops for the equivalent using the existing functional units is still beneficial, since it leaves open room for an immediate performance gain on all existing software when a future revision does add that dedicated hardware, or optimises its microarchitecture to allow those (possibly different) uops to execute faster.
Despite the name, ARM is definitely not very RISC anymore, and that's what has kept it competitive.