Show HN: Accelerating SHA256 by 100x in Golang on ARM

userbinator · on July 28, 2016

This is the sort of thing that shows why the whole "pure RISC philosophy" of implementing only the simplest instructions is ultimately a pretty dead-end in processor design. CISC-style dedicated instructions and hardware which utilises them will always be more efficient than a pure software implementation using just "simple RISC instructions", because hardware can easily express specialised computations like the extreme parallelism of algorithms such as SHA while software implementations essentially rely on the instruction-scheduling mechanisms to arrange potentially hundreds or even thousands of instructions in an optimal order.

In addition, when there is dedicated hardware, it is also significantly easier to add a SHA instruction than attempt to recognise sequences of many regular instructions that could be "fused together" into a single operation using that hardware. When there isn't, adding such an instruction that gets internally expanded into multiple uops for the equivalent using the existing functional units is still beneficial, since it leaves open room for an immediate performance gain on all existing software when a future revision does add that dedicated hardware, or optimises its microarchitecture to allow those (possibly different) uops to execute faster.

Despite the name, ARM is definitely not very RISC anymore, and that's what has kept it competitive.

ridiculous_fish · on July 28, 2016

> CISC-style dedicated instructions and hardware which utilises them will always be more efficient than a pure software implementation using just "simple RISC instructions"

For a counterexample, see the x86 string instructions. "Hardware" implementations like `repne scasb` are routinely outperformed by software implementations using SSE2.

Another problem is that these instructions don't die. SHA will some day be replaced, but the instruction will live on. The x86 BCD instructions illustrate this.

From what I remember, RISC is less about "simple" instructions, and more about regular instructions which can execute with predictable throughput, ideally one per clock cycle. "Do this particular arithmetic" is in the RISC philosophy, while looping instructions (rep, lswi, etc.) are not.

userbinator · on July 28, 2016

For a counterexample, see the x86 string instructions. "Hardware" implementations like `repne scasb` are routinely outperformed by software implementations using SSE2.

That's only because SCAS (and CMPS) has not (yet) received quite the same amount of attention as MOVS and STOS. For a counterexample to your counterexample, look up "enhanced REP MOVSB". REP MOVS/STOS can operation on cacheline-sized blocks since at least the P6, when it was introduced as the "fast strings" feature, and its performance has been steadily improved over the processor generations.

Another problem is that these instructions don't die. SHA will some day be replaced, but the instruction will live on. The x86 BCD instructions illustrate this.

Replaced for secure crypto, yes, but there are plenty of other applications like (nonmalicious) data corruption detection where a reasonably fast yet far more collision-resistant algorithm than regular CRC is very useful.

On the topic of BCD instructions, there's this: https://news.ycombinator.com/item?id=8477254

cmrx64 · on July 28, 2016

I do agree, but I wouldn't include vector instructions in "simple RISC instructions".

cmrx64 · on July 28, 2016

I think RISC-V is showing that RISC ISAs can still compete in this space: https://arxiv.org/pdf/1607.02318v1.pdf. RISC-V has extensibility of the ISA as a core feature, allowing for special-purpose accelerators and coprocessors as needed.

femto · on July 28, 2016

Or there is the middle ground of retaining the general purpose nature by making the "dedicated" hardware an FPGA. Potentially the best of both worlds. Xlinx has the Zynq (ARM+FPGA), Intel has bought Altera and is now doing CPUs with FPGAs in the package [2].

[1] http://www.xilinx.com/products/silicon-devices/soc.html

[2] http://www.pcworld.com/article/3055526/intel-starts-baking-s...

cmrx64 · on July 28, 2016

This can work for special-purpose projects, but might become problematic in general: reprogramming the FPGA can be slow (on the order of 10s of milliseconds) and you don't necessarily want to include FPGA state as part of task state in the OS. You also need to be careful about thermal limits and power draw (especially when reconfiguring). I doubt current-generation FPGAs will land in anything smaller than a server for a long time. There, you can have semi-permanent accelerators according to the server's role, and a larger thermal budget.

yalue · on July 28, 2016

This article is referring specifically to ARM64 assembly which is actually quite a bit more RISC-y than previous versions of the ARM ISA. In favor of simplicity, many of the more complex ARMv7 instructions (such as the block loads/stores) were removed in ARMv8. I'd still agree that ARM is not very RISC anymore, but the ongoing move to ARM64 takes it quite a bit back in that direction.

bostik · on July 28, 2016

Bingo.

If and when speed matters, ASIC beats a general-purpose CPU hands down, every time.

pjmlp · on July 28, 2016

Also in sync with the micro-coded CPUs of Burroughs, Xerox PARC, ETHZ, Genera and so on, even if they eventually lost to more general purpose CISC instructions.

mrb · on July 27, 2016

"Interestingly enough, there are actually comparable Intel SHA extensions to the ARM equivalents. Linux 4.4 has added support for this but so far we have not been able to identify any CPUs that will actually run this code."

Intel Goldmont is the only microarch that implements the instructions. It was "released" in April: http://www.extremetech.com/computing/226800-intels-new-low-c... But it is one of these annoying soft launches where the processors are not for sale anywhere, the specs are incomplete, and even ark.intel.com doesn't know about them. sigh

nickpsecurity · on July 27, 2016

The original one, that I'm aware, was VIA's C3 and such processors with Padlock Encryption. Came in the little Artigo computers I used in designs that gave me, for $300, a 1GHz processor at 25W, hypervisor support, crypto acceleration (incl RSA), and a TRNG. Bad little systems years ago when I was into them.

kalden_ · on July 27, 2016

It seemed to me that some Xeons had those since 2014. See https://software.intel.com/en-us/isa-extensions/intel-sha

Should something be understood instead?

mrb · on July 27, 2016

Huh you seem right. Only the multi-socket Haswell seems to have them (E7-xxxxv3 and E5-26xxv3).

And now Intel re-introduce the SHA instructions in some low-power low-end CPUs, but not for any desktop or single-socket server CPU? What a bizarre case of feature fragmentation. Typical Intel.

pbsd · on July 28, 2016

I think even Intel is confused about their own extensions at this point. The E7 and E5 Xeons are Haswell systems that slightly speed up SHA-256 and 512 with the introduced BMI2 RORX instruction.

As far as I know there are no SHA extensions before Goldmont, as mrb said, and in Cannonlake for non-low-power chips.

y4m4b4 · on July 28, 2016

Thanks this is a nice share, we are definitely looking forward to this.

wyldfire · on July 27, 2016

... by moving from a software implementation to a mostly hardware one. ;)

Still a worthwhile article but the title seemed to make me think of implementation/algorithm changes.

lossolo · on July 27, 2016

Most of crypto hashes implementations in Go are hand written assembly for specific platforms.

fwessels · on July 27, 2016

No, there are no algorithm changes (otherwise the results would be different), it is just taking advantage of the ARM SHA accelerations when they are available.

maplant · on July 27, 2016

I a change in algorithm does not necessarily lead to change in results. Two algorithms that produce the same results may be disparate, but I suppose it's up to your definition of disparate

halomru · on July 28, 2016

>there are no algorithm changes (otherwise the results would be different

Insertionsort, Mergesort and Timsort are three wildly different algorithms with different speeds, but on every possible input they produce the exact same result

sjolsen · on July 28, 2016

Strictly speaking, it's possible for different sorting algorithms to produce different results if you sort using a weak order rather than a total order.

cmrx64 · on July 28, 2016

Or an unstable sort (which none of grandparent's examples are, but aren't uncommon in naive implementations)

DanielDent · on July 27, 2016

Did you compare performance to SHA512? Despite being a theoretically more secure/"harder" algorithm, on 64 bit platforms it can sometimes be faster than SHA256. If you don't want to use 512 bits, using 256 bits of the output of SHA512 is standardized as SHA512/256 and is considered valid/secure.

(I'm unclear if this performance oddity remains true with the crypto hardware extensions being used here.)

Buge · on July 28, 2016

SHA512/256 is also secure against length extension, which SHA256 isn't.

fwessels · on July 27, 2016

No we did not compare to SHA512 (since we don't need it).

I would expect SHA512 to be 2x faster compared to the SHA256 software version, but that is still way slower than the ARM SHA extensions accelerated version.

cyphar · on July 27, 2016

The oddity is that I believe sha512 uses 64-bit words while sha256 uses 32-bit words. So on 64 bit hardware sha512 will be faster (presumably because using the 32 bit registers is slower somehow, I don't know). This is why ZFS is adding support for sha512/256 as it's Merkel tree hashing function.

tedunangst · on July 28, 2016

SHA512 eats data 512 bits at a time, while SHA256 eats it 256 bits at time. Both internally use 8 "registers", which are either 64 or 32 bits wide. Assuming you have the hardware registers to match, this would make SHA512 about twice as fast. But internally, it mixes data using 80 rounds, vs 64 for SHA256, so the speedup isn't quite 2x.

pbsd · on July 28, 2016

For the sake of pedantry, SHA-512 eats 1024 bits, resp. 512 bits for SHA-256, at a time. It's the chaining variables that are of those lengths.

DanielDent · on July 27, 2016

That fits with my understanding. What I don't know is how this interacts with the acceleration functionality the chip has. For instance, do the instructions end up being "please do sha256 on this data", or are they closer to AES-NI where it's "please perform one round of an AES encryption flow".

akovaski · on July 28, 2016

Judging from the source code[1] in the author's repository, it is the latter. (what I see is: perform 96 instructions, then subtract 64 from the message length, and repeat until the message length is 0)

Intel seems to be the same way [2](2013) - i.e. looping over multiple instructions per 64 byte block.

[1] https://github.com/minio/sha256-simd/blob/master/sha256block... [2] https://software.intel.com/en-us/articles/intel-sha-extensio...

DanielDent · on July 28, 2016

Interesting, thanks for pointing that out.

This code is such a good example of the potential value of formal verification tools for crypto primitives.

y4m4b4 · on July 27, 2016

For comparison with other checksums you should look here

https://github.com/minio/sha256-simd#comparison-to-other-has...

pbsd · on July 28, 2016

SHA-512 should be much faster than SHA-256 on AVX2-capable processors. Go's implementation is subpar; try OpenSSL instead. On Skylake, SHA-512 is about as fast as MD5 now -- https://bench.cr.yp.to/results-hash.html#amd64-skylake

y4m4b4 · on July 28, 2016

Yes true. OpenSSL version gives a great boost. Rather than SHA512 we ended up using Blake2b - https://github.com/minio/blake2b-simd (also optimized with SIMD instructions) internally for bit-rot verification in https://github.com/minio/minio

dharma1 · on July 28, 2016

How about accelerating Keccak/SHA-3 (used on Ethereum)? I found this but seems it's not implemented on ARM/Intel hardware yet

http://caslab.eng.yale.edu/workshops/hasp2016/HASP16-10.pdf

theparanoid · on July 27, 2016

The amount of assembly in the Go ecosystem is crazy, ~1% of the standard library is assembly. There are better ways.

giovannibajo1 · on July 27, 2016

Most of it is written to take advantage of special hardware instructions in CPUs. If you take OpenSSL (a C library), it's got many assembly code paths as well.

I'm not sure why this should be a problem, it just shows attention to performance in my opinion.

pcwalton · on July 28, 2016

Opaque assembly blocks that the compiler doesn't understand the effects of are actually suboptimal for performance.

Cthulhu_ · on July 28, 2016

Only if the compiler is aware of and can translate code into e.g. ARM-specific sha instructions. If not, then the developer is still smarter than the compiler and should use assembly to use those specific instructions.

dlsniper · on July 28, 2016

Is this a general thing or a Go specific one? This generic statement seems to be false given how many optimizations that actually need assembly exist in Go and not only.

pjmlp · on July 28, 2016

It is a general thing and quite common.

Either you keep adding intrisics to the compiler, or outsource it to an external Assembler.

Even managed languages have bytecode Assemblers available.

drchickensalad · on July 28, 2016

Generally yes, specialized no.

shinigami · on July 28, 2016

Rarely true for low level crypto code.

sgmansfield · on July 27, 2016

Do you have a source for the 1% number?

Also I'd like to know what you would consider an appropriate amount of assembly code.

bradleyjg · on July 27, 2016

What's the alternative in this case?

white-flame · on July 27, 2016

From a bird's eye view, adjust the compiler until it outputs the assembly you would have written. Hopefully this is done in such a way that such adjustments can be cross-beneficial across platforms, and can be easily created.

giovannibajo1 · on July 27, 2016

That's wishful thinking. GCC isn't able to do that with C language, a 20 years old compiler with a 40 years language. How would a compiler know that you can use a SHA2 instruction? Yes, there are intrinsics but then what about specialized vectorized instructions which don't match any primitive types or operations in the language (like, you can't express "add with carry" in C or Go)? And how can you explain a compiler that there is no data dependency in memory access between different buffers? Yes it might be able to prove it by itself in some cases, but history has shown that things like autovectorization are too fragile and can't be relied upon.

Optimized C libraries like compressors or security have assembly implementations as well. I'm not sure why using assembler should be considered a signal of a defect rather than a proof of optimization. There's really nothing wrong in using assembly to write a bytes.Index version optimized for AVX2.

DannyBee · on July 27, 2016

"That's wishful thinking. GCC isn't able to do that with C language, a 20 years old compiler with a 40 years language. How would a compiler know that you can use a SHA2 instruction?"

Well, actually, they could, it's just not worth pattern matching because it occurs so infrequently.

"but history has shown that things like autovectorization are too fragile and can't be relied upon."

Errr, i'd say the opposite. History has shown that good autovectorizing compilers can come pretty damn close to whatever you want.

Usually the only issue is that they may insert too many runtime checks, and that's easily solvable.

giovannibajo1 · on July 28, 2016

Can you show me an opensource library implementing performance-sensitive computations in any field (graphics, numerics, crypto, etc.) which relies on autovectorization to obtain the performance characteristics that clients rely upon?

The large majority of performance-sensitive libraries I know of rely on carefully written native code (C, C++), with a mixture of assembly and/or intrinsics. This, to me, is a failure to concretize autovectorization from textbooks into an industry-accepted solution.

pcwalton · on July 28, 2016

> How would a compiler know that you can use a SHA2 instruction?

Intrinsics.

There's basically zero benefit to coding this stuff in assembly. Many standard libraries prefer intrinsics, because there's basically zero benefit to writing this stuff in assembly and many drawbacks from an optimization perspective.

anonymous7777 · on July 27, 2016

ok mentioning Go here is a bit pointless. How did Go help compared to any other language when Arm has hardware instructions for SHA1&2?

fwessels · on July 27, 2016

Go didn't help specifically, it is just that this package makes support for the ARM SHA instructions available for Golang.

dmitshur · on July 28, 2016

Also Go has support for using assembly in a package natively, not all languages/build tools support that as easily.

pjmlp · on July 28, 2016

It is pretty common in imperative compiled languages.

Back when compilers were sold, the professional version always had an assembler in the box.

In MS-DOS even BASIC compilers like Turbo Basic could use inline Assembly.

serge2k · on July 27, 2016

> WORD $0x4cdf2025 // ld1 {v5.16b-v8.16b}, [x1], #64

Why is the code written as words like this instead of just the assembly instructions?

nickcw · on July 27, 2016

Because the Go assembler doesn't have those instructions built in most likely.

That sort of thing is quite common in Go assembler. The assemblers are a little primitive and they force the assembly language for each processor into a common pattern, so when you are writing ARM assembler for instance, everything is backwards.

I imagine having rationalised the assemblers between processors somewhat it makes the core developers lives easier though who have to lightly touch lots of different processor assembler.

pjmlp · on July 28, 2016

Backwards from AT&T point of view, right?

Because for me it feels very natural, given the Z80 and Intel Assembly.

fwessels · on July 27, 2016

The Golang assembly does not support all instructions, we've created a tool (see https://github.com/minio/asm2plan9s) to assist in translating instructions into their opcode equivalents.

chewxy · on July 28, 2016

Damn. That's neat. I wrote my own python script to translate whatever `as` spits out into BYTE $.... your tool'd have been neat when I wrote this http://blog.chewxy.com/2016/07/25/on-the-memory-alignment-of...

fwessels · on July 28, 2016

Nice blog, and the tool would have helped indeed.

And also, memory allocation is not the most obvious topic in Go/Golang assembly...

gnaritas · on July 27, 2016

So with hardware sha256, can this mine bitcoin cheaper than an ASIC?

fwessels · on July 27, 2016

No, compared to an ASIC it will still be significantly slower.

astrodust · on July 27, 2016

Does it do terahashes per second? If not, then no.

gnaritas · on July 27, 2016

Haven't been keeping up on how fast ASIC's are, just asking, thanks.

argo_ · on July 27, 2016

or find the private key of a bitcoin address faster?

smegel · on July 27, 2016

[flagged]

tptacek · on July 28, 2016

We're not supposed to write comments like this on "Show HN" posts. This is Apache-licensed open source code. You can criticize it on its own terms, but tendentious criticisms of the motivations of its authors are off-topic for Show HN. The rule can/should be boiled down to: "First, respect new work."

mangeletti · on July 28, 2016

Considering their code is licensed under Apache 2.0, you could do that yourself, if that's what you think is the right thing to do.

I don't see anything wrong with Minio writing a blog post to show off something neat they did.

Also, considering how ubiquitous both hashing and ARM have become, 100x is a pretty valuable speedup.

y4m4b4 · on July 28, 2016

This is the plan, which will happen eventually right now we are keeping it out there for community comments and improvement in quality that might be needed before it is submitted upstream.

rukittenme · on July 28, 2016

Or you could just kindly nudge him into opening a pull request. He's probably never done it before.

cloudjacker · on July 27, 2016

Its a bird! Its a plane!

No no thats just my bitcoin miner booting up

DanielDent · on July 28, 2016

Pretty sure there are planes which are quieter than some bitcoin miners...

MichaelRenor · on July 27, 2016

Are aes instructions new in aarch64 processors? I'm surprised this doesn't work magically "out of the box", but I'm not up to date with this stuff.

Also, this is another really good reason never to use a sha2 as your hashing algo for password storage.

loeg · on July 27, 2016

AES and SHA2 are completely different algorithms.

> Also, this is another really good reason never to use a sha2 as your hashing algo for password storage.

I don't know what you mean by this. I understand that SHA2 is poor for some things, but I'm not sure how this illustrates that.

niftich · on July 27, 2016

> but I'm not sure how this illustrates that

SHA2 just got even faster, so it illustrates rather well that you wouldn't want to use it for password storage -- of course that was always true, but the improvement drives that point home.

jerf · on July 27, 2016

Even if the ARM processor on a cycle-by-cycle basis keeps up with already-existing GPU SHA2 crackers (which I do not know), it will not keep up with the parallelism in the GPUs, which is easy for anybody to get through hashcat: https://hashcat.net/wiki/

SHA2 has not gotten faster. It was already blisteringly, unbelievably fast.

creshal · on July 27, 2016

If you're using any hash directly for password storage you're doing it wrong, regardless of the specifics.

PBKDF2-SHA2 with a reasonably high amount of rounds (>100000) is still okay-ish, although you should have transited to bcrypt years ago. And hopefully we'll see argon2 stabilizing and getting adoption soon™.

WhitneyLand · on July 27, 2016

My understanding is that SHA2 is fine for password storage, given a prudent number of hashing rounds.

niftich · on July 27, 2016

It's fine to use a SHA2 variant HMAC in PBKDF2 with an obscene, ever-increasing number of rounds, but the scrypt paper [1] provides a detailed justification why you may want to switch to a dedicated password hashing function regardless. To quote their website [2]:

"We estimate that on [circa-2009] hardware, if 5 seconds are spent computing a derived key, the cost of a hardware brute-force attack against scrypt is roughly 4000 times greater than the cost of a similar attack against bcrypt (to find the same password), and 20000 times greater than a similar attack against PBKDF2."

Specifically, PBKDF2 is a tool to produce mostly-okay password storage hash from crypto primitives that are totally unfit for it, while bcrypt, scrypt, Argon2 are purpose-designed to make certain kinds of attacks difficult, like time-memory tradeoffs, parallelized custom hardware attacks, etc.

[1] http://www.tarsnap.com/scrypt/scrypt.pdf

[2] http://www.tarsnap.com/scrypt.html

cperciva · on July 27, 2016

Almost right. bcrypt is not designed to be secure against ASICs; it requires a fixed circuit size. In fact, given that CPU acceleration is available for SHA2, I suspect that PBKDF2-SHA256 is now stronger than bcrypt based on the "if my server spends X seconds hashing this password, how much money will someone need to spend to crack it" metric.

tedunangst · on July 28, 2016

I waffle about with this metric. If it's cheap to buy a server with some feature, it's obviously cheap for an attacker to buy the same. Widespread SHA2 hardware means that SHA2 cracking hardware is also available to even the poorest of crackers. I think there's a sliding scale, where the ratio changes for some attackers but not for all.

cperciva · on July 28, 2016

Right, sophisticated attackers have had custom SHA256 circuits for a long time; less sophisticated attackers are only gaining them now that they are present in CPUs. But if you're defending against such less sophisticated attackers, you're still not losing anything; worst case, SHA256 instructions in CPUs give them the same speedup as you're getting so their cost remains fixed.

tedunangst · on July 28, 2016

The question is who upgrades their hardware faster, attackers or big enterprises? :)

y7 · on July 27, 2016

The problem with using iterated SHA2 for passwords is that GPU/FPGA/ASIC are orders of magnitude faster than the CPU-based "normal" usage. So while increasing the number of rounds makes it more difficult for an attacker, they already have an edge by being able to use faster technologies.

Ideally you'd use something like Argon2 or any of the other finalists from the Password Hashing Competition[1].

1: https://password-hashing.net/

sbierwagen · on July 27, 2016

  I don't know what you mean by this. I understand that 
  SHA2 is poor for some things, but I'm not sure how this 
  illustrates that.

You want password hashes to be very slow, (and to use a lot of RAM) to make brute force attacks difficult. SHA2 is fast, and having hardware support makes it even faster. This makes it a bad choice for a password hash.

MichaelRenor · on July 27, 2016

It was a typo in my post, I meant sha2 and not AES.