Optimizing AES modes on Solaris for Intel Westmere
Review
AES is a strong method of symmetric (secret-key) encryption.
It is a U.S. FIPS-approved cryptographic algorithm (FIPS 197)
that operates on 16-byte blocks.
AES has been available since 2001 and is widely used.
However, AES by itself has a weakness.
AES encryption isn't usually used by itself because identical blocks
of plaintext are always encrypted into identical blocks of ciphertext.
This encryption can be easily attacked with "dictionaries" of common blocks of text and allows one to more-easily discern the content of the unknown cryptotext.
This mode of encryption is called "Electronic Code Book" (ECB), because one in theory can keep a "code book" of all known cryptotext and plaintext results to cipher and decipher AES.
In practice, a complete "code book" is not practical, even in electronic form, but large dictionaries of common plaintext blocks is still possible.
Here's a diagram of encrypting input data using AES ECB mode:
Block 1 Block 2
PlainTextInput PlainTextInput
| |
| |
\/ \/
AESKey-->(AES Encryption) AESKey-->(AES Encryption)
| |
| |
\/ \/
CipherTextOutput CipherTextOutput
Block 1 Block 2
What's the solution to the same cleartext input producing the same ciphertext output?
The solution is to further process the encrypted or decrypted text in such a way that the same text produces different output.
This usually involves an Initialization Vector (IV) and XORing the decrypted or encrypted text.
As an example, I'll illustrate CBC mode encryption:
Block 1 Block 2
PlainTextInput PlainTextInput
| |
| |
\/ \/
IV >----->(XOR) +------------->(XOR) +---> . . . .
| | | |
| | | |
\/ | \/ |
AESKey-->(AES Encryption) | AESKey-->(AES Encryption) |
| | | |
| | | |
\/ | \/ |
CipherTextOutput ------+ CipherTextOutput -------+
Block 1 Block 2
The steps for CBC encryption are:
Start with a 16-byte Initialization Vector (IV), choosen randomly.
XOR the IV with the first block of input plaintext
Encrypt the result with AES using a user-provided key.
The result is the first 16-bytes of output cryptotext.
Use the cryptotext (instead of the IV) of the previous block
to XOR with the next input block of plaintext
Another mode besides CBC is Counter Mode (CTR).
As with CBC mode, it also starts with a 16-byte IV.
However, for subsequent blocks, the IV is just incremented by one.
Also, the IV ix XORed with the AES encryption result (not the plain text input).
Here's an illustration:
Block 1 Block 2
PlainTextInput PlainTextInput
| |
| |
\/ \/
AESKey-->(AES Encryption) AESKey-->(AES Encryption)
| |
| |
\/ \/
IV >----->(XOR) IV + 1 >---->(XOR) IV + 2 ---> . . . .
| |
| |
\/ \/
CipherTextOutput CipherTextOutput
Block 1 Block 2
Optimization
Which of these modes can be parallelized?
ECB encryption/decryption can be parallelized because it does more than plain AES encryption and decryption, as mentioned above.
CBC encryption can't be parallelized because it depends on the output of the previous block.
However, CBC decryption can be parallelized because all the encrypted blocks are known at the beginning.
CTR encryption and decryption can be parallelized because the input to each block is known--it's just the IV incremented by one for each subsequent block.
So, in summary, for ECB, CBC, and CTR modes, encryption and decryption can be parallelized with the exception of CBC encryption.
How do we parallelize encryption?
By interleaving.
Usually when reading and writing data there are pipeline "stalls" (idle processor cycles) that result from waiting for memory to be loaded or stored to or from CPU registers.
Since the software is written to encrypt/decrypt the next data block where pipeline stalls usually occurs, we can avoid stalls and crypt with fewer cycles.
This software processes 4 blocks at a time, which ensures virtually no waiting ("stalling") for reading or writing data in memory.
Other Optimizations
Besides interleaving, other optimizations performed are
Loading the entire key schedule into the 128-bit %xmm registers.
This is done once for per 4-block of data (since 4 blocks of data is processed, when present). The following is loaded:
the entire "key schedule" (user input key preprocessed for encryption and decryption). This takes 11, 13, or 15 registers, for AES-128, AES-192, and AES-256, respectively
The input data is loaded into another %xmm register
The same register contains the output result after encrypting/decrypting
Using SSSE 4 instructions (AESNI).
Besides the aesenc, aesenclast, aesdec, aesdeclast, aeskeygenassist, and aesimc AESNI instructions,
Intel has several other instructions that operate on the 128-bit %xmm registers.
Some common instructions for encryption are:
pxor exclusive or (very useful),
movdqu load/store a %xmm register from/to memory,
pshufb shuffle bytes for byte swapping,
pclmulqdq carry-less multiply for GCM mode
Combining AES encryption/decryption with CBC or CTR modes processing.
Instead of loading input data twice (once for AES encryption/decryption,
and again for modes (CTR or CBC, for example) processing,
the input data is loaded once as both AES and modes operations occur at in the same function
Performance
Everyone likes pretty color charts, so here they are. I ran these on Solaris 11 running on a Piketon Platform system with a 4-core Intel Clarkdale processor @3.20GHz. Clarkdale which is part of the Westmere processor architecture family.
The "before" case is Solaris 11, unmodified.
Keep in mind that the "before" case already has been optimized with hand-coded Intel AESNI assembly.
The "after" case has combined AES-NI and mode instructions,
interleaved 4 blocks at-a-time.
«
For the first table, lower is better (milliseconds).
The first table shows the performance improvement using the Solaris encrypt(1) and decrypt(1) CLI commands. I encrypted and decrypted a 1/2 GByte file on /tmp (swap tmpfs).
Encryption improved by about 40% and decryption improved by about 80%.
AES-128 is slighty faster than AES-256, as expected.
The second table shows more detail timings for CBC, CTR, and ECB modes for the 3 AES key sizes and
different data lengths.
»
The results shown are the percentage improvement as shown by an internal PKCS#11 microbenchmark.
And keep in mind the previous baseline code already had optimized AESNI assembly!
The keysize (AES-128, 192, or 256) makes little difference in relative percentage improvement (although, of course, AES-128 is faster than AES-256).
Larger data sizes show better improvement than 128-byte data.
Availability
This software is in Solaris 11 FCS.
It is available in the 64-bit libcrypto library and the "aes" Solaris kernel module.
You must be running hardware that supports AESNI
(for example, Intel Westmere and Sandy Bridge, microprocessor architectures).
The easiest way to determine if AES-NI is available is with the isainfo(1) command. For example,
$ isainfo -v
64-bit amd64 applications
pclmulqdq aes sse4.2 sse4.1 ssse3 popcnt tscp ahf cx16 sse3 sse2 sse
fxsr mmx cmov amd_sysc cx8 tsc fpu
32-bit i386 applications
pclmulqdq aes sse4.2 sse4.1 ssse3 popcnt tscp ahf cx16 sse3 sse2 sse
fxsr mmx cmov sep cx8 tsc fpu
No special configuration or setup is needed to take advantage of this software.
Solaris libraries and kernel automatically determine if it's running on AESNI-capable machines
and execute the correctly-tuned software for the current microprocessor.
Summary
Maximum throughput of AES cipher modes can be achieved by combining
AES encryption with modes processing,
interleaving encryption of 4 blocks at a time,
and using Intel's wide 128-bit %xmm registers and instructions.
References
"Block cipher modes of operation", Wikipedia
Good overview of AES modes (ECB, CBC, CTR, etc.)
"Advanced Encryption Standard", Wikipedia
"Current Modes" describes NIST-approved block cipher modes
(ECB,CBC, CFB, OFB, CCM, GCM)