x64 hardware accelerated computations of SHA-256 hashes

Introduction

SHA-256 is a popular hash function used in various applications, including data integrity verification and cryptocurrency proof-of-work. This post serves two objectives: to introduce you to blockchain and proof-of-work used in cryptocurrency mining and to provide you with a library using x64 instruction set extensions to speed up SHA-256 computations for cryptocurrency mining (and similar use cases); the library has C/C++ and Python interfaces. I cover both objectives simultaneously using a walkthrough of adapting code from a library for data integrity verification (‘Intel(R) Intelligent Storage Acceleration Library Crypto Version’) into my library for cryptocurrency proof-of-work.

Background

A hash function is a function that takes an input of any size and produces an output of a fixed size (regardless of the size of the input). The output produced (referred to as a hash value or digest, or simply hash) must be sensitive to any minute changes to the input, such that different inputs have a low probability of producing the same output (referred to as a collision). Thus, the hash value can be a fix-sized “fingerprint” to identify (almost) uniquely any data of any size.

For cryptography use cases, to avoid tampering or fabrication of such a “fingerprint”, there is one additional requirement on the hash function: it must involve computationally expensive calculations such that it is mathematically improbable to find any shortcuts to bypass any parts of the calculations. In order words, no malicious person should be able to take a given “fingerprint” and recreate their desired input data to match that given “fingerprint”. In other words, the “fingerprint” must be secure and tamper-proof to be trusted. To meet this additional requirement, most hash functions, such as the SHA-256, are designed to use sequential calculations, i.e., the final hash value depends on the entire input data, and must be calculated from the beginning of the data to the end of the data; hence, parallel computations cannot be used to speed up the calculation of the hash value for a given input data. Usually, this is implemented using a state-based algorithm, where a state value is initialized at the start of the calculation. The algorithm processes the input data sequentially, a chunk or a block at a time. (The term “block” here should not be confused with the term “block” used in blockchain; to avoid any confusion, I will use the term “chunk” in the rest of this post.) Usually, each chunk is of a fixed size, e.g., 512 bytes in SHA-256. Then, the state value is mutated at each step into a new state value based on the current state value and the next chunk. The new state value is the input for the next step of the calculation, and the final state value after all the input data is processed is the hash value. Note that although the calculation cannot be accelerated for one single piece of input, independent inputs can still be calculated in parallel, e.g., computing the hash values for two separate data files. I will return to this point later, as it is the core of this post.

Hash functions have many applications. In this post, I will only look at two applications of the hash function:

Data integrity verification. The hash value is used as a “fingerprint” to verify that the data has not been tampered with. You may be familiar with such a situation when downloading files from the Internet, where a hash is listed on the download site. The hash is calculated from a commonly and publicly known secure hash function, such as the SHA-256. After downloading the file, you would calculate the hash using the file contents as input and then check that the output matches the value published on the download site. It would be practically impossible for a malicious actor to insert some form of malware into the contents without changing the hash value, i.e., alerting the data recipient.
Proof-of-work in cryptocurrency. Cryptocurrency is a form of decentralized ledger, where all transactions are published publicly. The security is ensured using blockchain and a consensus system; the two most popular systems are proof-of-work (used by BitCoin) and proof-of-stake (used by Etherium). I will not discuss proof-of-stake in this post as it is more complicated, and this post is focused on proof-of-work, which I will explain next.

Blockchain and proof-of-work

To understand blockchain and proof-of-work, it is necessary first to understand how “digital signatures” work. In a nutshell:

The system employs three components: a private key, a public key, and an encryption algorithm such as RSA.
- A private key is a constant (a few bytes) held by the sender of the information and is never revealed to anyone else.
- A public key is a constant (a few bytes) that identifies the sender of the information. It is like a bank account number and works jointly with the private key. Each public key is unique to the private key and is usually generated from the private key using the encryption algorithm (described next); however, it is mathematically impossible to calculate the private key from the public key.
- The encryption algorithm is the magical piece. For a given data to be sent, i.e., a message:
  - A signature X(M, V) is calculated from the message M and the private key V based on the algorithm; the public key for V is P(V).
  - Anyone should be able to verify (with minimal computing effort) based on P(V), M and X(M, V) that X was indeed calculated from M using V based on the algorithm. However, none of the information should allow anyone to figure out V or figure out how to calculate X(M’, V) for another message M’; this ensures no impersonation is possible.
To ensure that the message is not tampered with, the sender does not send out X(M, V) but instead sends out X(H(M), V), where H is a cryptographically secure hashing function. The recipient then calculates H(M) and, as described above, can verify that M was indeed sent by the person holding key V and that M had not been tampered with in any way, i.e., verify the identity of the sender of the message as well as verify the integrity of the message received.

Next, let us look at what is a “block” in a blockchain. Each block contains the following information:

The hash value of the previous block in the chain.
A set of transaction information, including time stamps, each digitally signed by the payor in that transaction, using their unique public key as their account ID.
An additional transaction, mining fee, paid to the “miner”.
Account ID of the “miner”, described below.
Digital signature of the “miner”, for the above items.
A number called a “nonce”, described below; this is a 32-bit integer in BitCoin.
The hash value of this block calculated based on all of the above items.

Any participant on the blockchain network can be a “miner”. Once there are enough transactions to fill a block, the system will let participants on the blockchain know items (1) to (3) above and a target criteria to meet for (7), such as a minimum value. The first miner to come up with an “answer” for (7) that meets the given target criteria will be the successful miner; anyone on the blockchain network will be able to verify (with relatively little computation) that the “answer” is valid. The block is then added to the blockchain. It will include the miner’s information (4) and their reward, the mining fee (3). There is no shortcut to finding the “answer”; the only way to find a valid answer is through brute force, i.e., trial-and-error testing over all possible nonce values (6). Hence, the chance of winning is proportional to the amount of computing power the miner invested in!

The above seems complicated, but it all boils down to this: no one should be able to record any made-up transactions on the blockchain due to the following:

It is possible to make up just any information in (3), as the information is all digitally signed by the payor.
However, the exact same transaction can be impersonated on the next or any later blocks, i.e., same payor, same recipient, same amount and same time; this is known as the “pay twice” problem.
Fortunately for everyone, for the above to work, the malicious person must win a new mining contest for the new block because the value in (1) has changed!

It is possible to win the new contest, but statistically improbable since many, many miners are competing for the new block. However, mining is usually done in “pools”, so if everyone in the world except one person pools together, it is statistically guaranteed that everyone except that person will win.

Data integrity verification vs cryptocurrency mining

With a clear picture of data integrity verification and cryptocurrency mining, we can now see how Intel’s library needs to be adapted for a cryptocurrency mining use case.

First, as I noted earlier, it is not possible to parallelize the hash calculations for a single input. The essential detail we need to know about the SHA-256 algorithm is that it operates on 32-bit words. Hence, for each input, the hash calculation needs to be calculated serially (i.e., in sequence) for a 32-bit value. The x64 instruction set contains two classes of extensions that can speed up the calculation of SHA-256:

When multiple inputs are involved, the vector instruction sets (SSE4, AVX, AVX2 and AVX512) can be used. For example, AVX2 can handle 256-bit data per computation; hence, 8 (=256/32) inputs can be processed simultaneously (to output 8 hash values, one for each input).
For only a single input, the SHA-NI extensions contain various special instructions that correspond to certain computations in SHA-256.

Of course, in addition to the above, multithreading can be used when there are multiple inputs since most modern CPUs have many cores.

Of course, in addition to the above, multithreading can be used when there are multiple inputs since most modern CPUs have many cores. The code in Intel’s library was written for single core and focused on calculating the hash values for a few large inputs; this is the typical situation in data integrity verifications. In contrast, for cryptocurrency mining, we will need to calculate many hashes for small inputs. The raw inputs are not small, but since the “nonce” (6) comes after all the steps up until (5), we can exploit the fact that the hash function is a state-based algorithm, i.e., for each nonce that we test, we do not calculate the hash value (7) starting from the beginning (1), but rather, from after item (5). Hence, I modified Intel’s library as follows:

Hence, I modified Intel’s library as follows:

In the assembly code, be able to restart from a given state value rather than right from the beginning.
- If you are interested, the key modifications to the assembly code required knowledge of function call conventions and how to use the nasm tool.
My C++ code creates multiple threads to perform multithreaded computations; the choice of extension set and the number of threads are based on the detected CPU.
- If you are interested, the key “trick” in keeping the multithreading code simple here was to set each thread to calculate a nonce of a fixed value modulo the number of threads. For example, if there are 4 threads in total, thread #1 would test nonce values of 0, 4, 8 and so forth, thread # 2 would test nonce values 1, 5, 9 and so forth, and similarly for threads #3 and #4. The only lock (“std::atomic” in C++) needed was to flag to other threads once a thread discovers a valid answer; the answer is also contained in the lock.

The code and unit tests are all available on the public repository mentioned earlier under my GitHub account. The library can be called from any Python or C/C++ code.

Conclusion

I hope this post helped you better understand blockchain and proof-of-work, and that you will find some use for my library. Unfortunately, although the speed-up is significant, my library will not help you dominate any cryptocurrency mining contest. The graph below shows how my library performs compared to various other approaches.