Reading OpenSSL's code
The first UFC was an interesting contest: although seasoned athletes, the participants had no way to prepare for this first-of-a -kind contest. Digging into a new code base often feels like this: experience helps ,but exploration is ad hoc and to quote the UFC 1 tag line: “There are no rules!”.
So how do you go about it?
In this article ,I will show how I went about updating the kernel for base64 encoding kernel with simdutf’s. Going in my strategy for reading code, the improvements I identified, and the improvements that I made. I will also tell you why I decided to make these decisions and the conscious tradeoffs and things that can still be improved. But first things first…
Finding a hook
The first job is to find a point of entry, after which we try to follow the program flow.
A good tip is to use the CLI to figure out what does what. But here, the point of entry had already been scoped out .I took a look at the function signature I was to replace:
int evp_encodeblock_int(EVP_ENCODE_CTX *ctx, unsigned char *t,
const unsigned char *f, int dlen, int *wrap_cnt);
( bolded code highlight changes I made, I added an argument that I’ll get to later )
Throughout , I would ask myself what impact my changes would have on the rest of the codebase.
The function’s name was all in lower case, that meant that was not a public-facing function. I double-checked via the search feature in Visual Studio Code and I confirmed that it was only used within the encode.c file. Good news: that meant I could change it without affecting anything else.
This function was very close to what was in simdutf but replacing only this did not budge the benchmark one iota.
Take a look at the data structures in use.
A programmer will always put extra thought as to where and how memory is accessed if he cares about fast code at all. As such data structures are an excellent starting point into a new codebase.
Here is the struct that was the first argument:
struct evp_Encode_Ctx_st {
/* number saved in a partial encode/decode */
int num;
/*
* The length is either the output line length (in input bytes) or the
* shortest input line length that is ok. Once decoding begins, the
* length is adjusted up each time a longer line is decoded
*/
int length;
/* data to encode */
unsigned char enc_data[80];
/* number read on current line */
int line_num;
unsigned int flags;
};So even without looking at the code or the next section of this article, it hints at the problem already.
The buffer was small (80 bytes maximum), so the AVX2 routine would not fire in any circumstances (it loaded 4x 32 bytes SIMD vector at a time). I had to get rid of this.
You can have any length you want, as long as it’s 256 bits.
Next, I took a look at the Evp_EncodeUpdate function in the encode.c file, which was part of the API and originally took care of newline insertion.
int EVP_EncodeUpdate(EVP_ENCODE_CTX *ctx, unsigned char *out, int *outl,
const unsigned char *in, int inl)
{
int i, j;
size_t total = 0;
*outl = 0;
if (inl <= 0)
return 0;
OPENSSL_assert(ctx->length <= (int)sizeof(ctx->enc_data));
if (ctx->length - ctx->num > inl) {
memcpy(&(ctx->enc_data[ctx->num]), in, inl);
ctx->num += inl;
return 1;
}
if (ctx->num != 0) {
i = ctx->length - ctx->num;
memcpy(&(ctx->enc_data[ctx->num]), in, i);
in += i;
inl -= i;
j = evp_encodeblock_int(ctx, out, ctx->enc_data, ctx->length);
ctx->num = 0;
out += j;
total = j;
if ((ctx->flags & EVP_ENCODE_CTX_NO_NEWLINES) == 0) {
*(out++) = ‘\n’;
total++;
}
*out = ‘\0’;
}
while (inl >= ctx->length && total <= INT_MAX) {
j = evp_encodeblock_int(ctx, out, in, ctx->length);
in += ctx->length;
inl -= ctx->length;
out += j;
total += j;
if ((ctx->flags & EVP_ENCODE_CTX_NO_NEWLINES) == 0) {
*(out++) = ‘\n’;
total++;
}
*out = ‘\0’;
}
if (total > INT_MAX) {
/* Too much output data! */
*outl = 0;
return 0;
}
if (inl != 0)
memcpy(&(ctx->enc_data[0]), in, inl);
ctx->num = inl;
*outl = (int)total;
return 1;
}
In a nutshell, at first, it encodes whatever is in the buffer before encoding the input proper. The buffer in this case is likely not present for performance reasons but only to insert newlines.
I can now explain why I decided to put a wrap_cnt counter.
Here , by default , the newline insertion was at 64 bytes, but if you were a developper and for whatever reason you were willing to go beyond the API and dig into OpenSSL’s internals, you could also find a way to insert newlines at arbitrary positions (up to intervals of around 115 bytes).
Here’s the problem: SIMD vectors come only in fixed lengths.
So when you insert newlines, they will not align perfectly with SIMD vectors so this was to track where exactly the newlines were. For example , a AVX2 vector is 32 bytes. We loaded 4 vectors at a time per loop iteration.If the newline was inserted at 72 bytes, that meant that there would be only one byte inserted at the third vector, but on the next iteration said newline would inserted at the second vector.
Possible improvements
From there, as is often the cases with SIMD, there is a bit of ambiguity as to what to do at the edges. E.g.
If a newline is inserted in the middle of a SIMD vector, it is a pain but it is at least clear-cut: the responsibility falls straight on the shoulders of the SIMD code.
However, what do I do with the newlines that falls right at the very end of a SIMD vector? Who gets the responsibility of inserting it there? Is it the SIMD kernel, or is it the scalar function that does the cleanup? And how do I make sure that there’s no confusion between those two parts?
In this case, I chose to free Evp_EncodeUpdate from the burden of newline insertion and (mostly) give it to our new AVX2 and scalar functions instead.
The main reason is mostly aesthetics. It’s just not immediately obvious what the codebase does just by casually looking at it..Even without my modifications, I constantly needed to shuffle between different files and definitions to figure out how things worked.The bulk of the newline insertion was done in the AVX2 kernel anyways and having just added two extra files, it was better to leave the logic closer to its source if possible.
Now could it be refactored to be simpler even? I noted that most likely yes. In the process of writing this report, I noticed that Evp_EncodeUpdate still dealt with the final newline where the insertion was not at a multiple of four bytes … which was … a bit inconsistent and odd.
I made a little bit of a mistake here:
I didn’t have a grand plan on how it was going to look in the end.
In total there were 7 branches that dealt with newline insertions in the AVX2 kernel. In the scalar one, there were two.Some of them did not handle the ending newlines completely the same way , hence the exception. Because development was very ad hoc, and I hacked around stuff a lot, I realized it could be a little bit more unified.
It wasn’t that dramatic, it did after all work, but I noted that it was something that I might have to improve, if only for consistency and aesthetic reasons.
Tackling the BIO abstraction
Since it was going through the BIO system, I had to also see if I could make cut there.
As usual, it is useful to start with the data structure in the bio_b64.c file:
typedef struct b64_struct {
/*
* BIO *bio; moved to the BIO structure
*/
int buf_len;
int buf_off;
int tmp_len; /* used to find the start when decoding */
int tmp_nl; /* If true, scan until ‘\n’ */
int encode;
int start; /* have we started decoding yet? */
int cont; /* <= 0 when finished */
EVP_ENCODE_CTX *base64;
unsigned char buf[EVP_ENCODE_LENGTH(B64_BLOCK_SIZE) + 10];
unsigned char tmp[B64_BLOCK_SIZE];
unsigned char *encoded_buf;
size_t encoded_buf_len;
} BIO_B64_CTX;
So just from a glance, we can more or less take an educated guess as to what the code will be doing. We can also tell there will likely be a problem:
B64_BLOCK_SIZE is only 1024 bytes. It’s better than the above situation, but even if you can fit in a few SIMD vectors, benchmarks revealed that the constant context switching from SIMD to scalar and back was killing the performance.
We can take a look at how the BIO_write function that I (mostly) removed:
while (inl > 0) {
n = inl > B64_BLOCK_SIZE ? B64_BLOCK_SIZE : inl;
if ((BIO_get_flags(b) & BIO_FLAGS_BASE64_NO_NL) != 0) {
///////////////////
// SECTION #1
// ///////////////
if (ctx->tmp_len > 0) {
if (!ossl_assert(ctx->tmp_len <= 3)) {
ERR_raise(ERR_LIB_BIO, ERR_R_INTERNAL_ERROR);
return ret == 0 ? -1 : ret;
}
n = 3 - ctx->tmp_len;
/*
* There’s a theoretical possibility for this
*/
if (n > inl)
n = inl;
memcpy(&(ctx->tmp[ctx->tmp_len]), in, n);
ctx->tmp_len += n;
ret += n;
if (ctx->tmp_len < 3)
break;
ctx->buf_len =
EVP_EncodeBlock(ctx->buf, ctx->tmp, ctx->tmp_len);
if (!ossl_assert(ctx->buf_len <= (int)sizeof(ctx->buf))) {
ERR_raise(ERR_LIB_BIO, ERR_R_INTERNAL_ERROR);
return ret == 0 ? -1 : ret;
}
if (!ossl_assert(ctx->buf_len >= ctx->buf_off)) {
ERR_raise(ERR_LIB_BIO, ERR_R_INTERNAL_ERROR);
return ret == 0 ? -1 : ret;
}
/*
* Since we’re now done using the temporary buffer, the
* length should be 0’d
*/
ctx->tmp_len = 0;
///////////////////
// SECTION #1 end
// ///////////////
} else {
///////////////////
// SECTION #2
// ///////////////
if (n < 3) {
memcpy(ctx->tmp, in, n);
ctx->tmp_len = n;
ret += n;
break;
}
n -= n % 3;
ctx->buf_len =
EVP_EncodeBlock(ctx->buf, (unsigned char *)in, n);
if (!ossl_assert(ctx->buf_len <= (int)sizeof(ctx->buf))) {
ERR_raise(ERR_LIB_BIO, ERR_R_INTERNAL_ERROR);
return ret == 0 ? -1 : ret;
}
if (!ossl_assert(ctx->buf_len >= ctx->buf_off)) {
ERR_raise(ERR_LIB_BIO, ERR_R_INTERNAL_ERROR);
return ret == 0 ? -1 : ret;
}
ret += n;
}
///////////////////
// SECTION #2 end
// ///////////////
} else {
///////////////////
// SECTION #3
// ///////////////
if (!EVP_EncodeUpdate(ctx->base64, ctx->buf, &ctx->buf_len,
(unsigned char *)in, n))
return ret == 0 ? -1 : ret;
if (!ossl_assert(ctx->buf_len <= (int)sizeof(ctx->buf))) {
ERR_raise(ERR_LIB_BIO, ERR_R_INTERNAL_ERROR);
return ret == 0 ? -1 : ret;
}
if (!ossl_assert(ctx->buf_len >= ctx->buf_off)) {
ERR_raise(ERR_LIB_BIO, ERR_R_INTERNAL_ERROR);
return ret == 0 ? -1 : ret;
}
ret += n;
///////////////////
// SECTION #3 end
// ///////////////
}
///////////////////
// SECTION #4
// ///////////////
inl -= n;
in += n;
ctx->buf_off = 0;
n = ctx->buf_len;
while (n > 0) {
i = BIO_write(next, &(ctx->buf[ctx->buf_off]), n);
if (i <= 0) {
BIO_copy_next_retry(b);
return ret == 0 ? i : ret;
}
n -= i;
ctx->buf_off += i;
if (!ossl_assert(ctx->buf_off <= (int)sizeof(ctx->buf))) {
ERR_raise(ERR_LIB_BIO, ERR_R_INTERNAL_ERROR);
return ret == 0 ? -1 : ret;
}
if (!ossl_assert(ctx->buf_len >= ctx->buf_off)) {
ERR_raise(ERR_LIB_BIO, ERR_R_INTERNAL_ERROR);
return ret == 0 ? -1 : ret;
}
}
ctx->buf_len = 0;
ctx->buf_off = 0;
///////////////////
// SECTION #4 end
// ///////////////
}
The way it worked was a bit curious, given what we now know about Evp_encodeUpdate but went something like this:
In case of no newlines,section #2 encoded the input up to size_of_input modulo 3 and sent the reminder back to section #1.
The first branch (Section #1) encoded the remainder before sending the rest to section #4.
In case of newlines, Section #3 dealt with the case of newlines in which case it simply called the Evp_EncodeUpdate function from earlier.
Newlines or not, it would write to the buf[] and send it to Section #4 who would deal with the broader BIO.
Section #4 dealt with interacting with the rest of the BIO system.
There were a lot of questions that weren’t obvious just by looking at the source code like:
How did this strange situation where two functions did almost the same thing came to be? Why was there a 1024 bytes buffer to begin with while there was already an 80 bytes buffer inside the API function?
Was there any particular reason why it tried sending it to the next BIO_write after only 1024 bytes?
It made sense to chunk in certain contexts , but not in the context of base64… I intuited that some kind of networking BIO module would be a far more appropriate place to put this logic.
So here I had to go reading up on some other BIO_write examples. Thankfully, there seemed to be nothing in the others that pointed to similar restrictions.
Now there was another question as to whether I could just reuse the buffer? Is it really worth an extra allocation?
Since I was allocating memory, could someone just declare a length with a quadrillion bytes and make whatever computer was using it explode? Who was using it?
There was another buffer for all BIOs that encapsulated the function that dealt with how many bytes , however , this one had a default value and was modifiable via the CLI with the -bufsize argument. On my home computer, setting it to the size of the L1 cache was the sweet spot in terms of performance, but in theory, the end user could input as much memory as his rig had available.
Given the context, I felt that if the end user was tweaking that much, he probably knew what he was doing and I should not make a decision for him. So I took a chance and I added the encoded_buf* pointer in case there was someone that wanted a gigantic buffer.
A lot of what this loop did seemed redundant and the job was already done by our function so I cut the most there. There were safety checks but they were specific to the tmp[] and buf[] buffers. If I didn’t use them, I didn’t feel the need to include them.
Conclusion
It actually sounds a lot more sophisticated than it actually is:
For the most part, it was a lot of banging my head on the wall, reading code, poking at it and whiteboarding.
These kind of questions sound silly in isolation. There’s no grand narrative that will make it blow up on social media,but they do add up over time and you can’t really budget for them until you get your hands dirty. The other thing I didn’t really mention is that there’s also a lot of time spent just finding where things are and how they work.

