Adventures in non-deterministic code (Always initialized your structs!)

Jan 17 2020

I’ve been working on an old code base lately. It’s a relatively simple data in/data out, kinda program. But It’s also a bit messy, written in C using outdated techniques, and suffers from memory leaks. So I am going through the process of cleaning it up and porting it to C++ so the code can be made simpler and easier to manage.

Being that I didn’t write the original code, my first thought was to create some sort of regression tests. My plan was simple. Run the code on a bunch of varied input files, and save the output. Now, every subsequent run I can just diff the results with the “known good” ones, and I should be good to go… right?

Sadly, this was not the case.

I started making my changes, initially to just the build system since I’ve become a big fan of CMake. It seemed harmless enough of a change, but just for good measure, I went to go run the regression tests.

Failure.

Um… what? I hadn’t even changed any code yet! WTF was going on? In frustration, I ran the test again, and amazingly, it passed this time. Sadly, this code has a small amount of non-determinism to it. I was dealing with a case of undefined behavior, or “UB” as many in the C and C++ communities tend to say.

I scoured the code, adding zero initializers to every local variable I could find. Made sure all heap allocations were done with calloc so they would be zero-initialized. Still, no luck, it would still seemingly randomly, fail about half the time.

I ran it with LLVM’s ASAN, nothing, clean bill of health. In fact, it always passed when running with ASAN enabled! I knew it was UB since the only input to the code was data files to process, but I just couldn’t figure out where it was.

Then I started going down the wrong path. I thought, hey, maybe I’m hitting some really obscure compiler bug. I started playing with compiler flags and noticed a strange pattern that with specific optimizations, I could get it to reliably pass. My advice to anyone experiencing this kind of problem is simply don’t dive down that rabbit hole. Sure, you may find some flags that “make it work”, but if you do, you haven’t really addressed the underlying problem. And you will have only given yourself a justification to consider it “good enough” and move on without giving it further thought.

But the code is still broken, and future compilers may not optimize in your favor.

I decided on a new tactic (which maybe, should have been my first one). I would carefully look at the diffs and try to determine what’s different.

Aha!

The difference is always that Unix timestamps are off. When they are wrong, they are wrong by exactly 3600 seconds. “Well that’s super weird”, I thought. The code never cares about the current system time, it only parses times from the source files. I looked at the code and found this function:

time_t parse_date(const char *str) {
    struct tm t;

    if (!strptime(str, "%a, %d %b %Y %H:%M:%S %Z", &t)) {
	    fprintf(stderr, "Invalid date: %s.\n", str);
	    return 0;
    }

    return mktime(&t);
}

If you’ve spotted the bug, well kudos to you! Because it took me more time than I’d like to admit. This code looked good to me! I have a structure and am immediately filling it in with strptime and not using it if there is any error at all.

First I confirmed that this code was the problem by feeding it hardcoded values, and I was able to get it to give me two different results for the same string. Still, off by exactly 3600 seconds.

I was ready to basically throw my hands in the air in defeat when I finally made the smart decision to read the man page for strptime and saw this passage:

In principle, this function does not initialize tm but stores only the values specified. This means that tm should be initialized before the call. Details differ a bit between different UNIX systems. The glibc implementation does not touch those fields which are not explicitly specified, except that it recomputes the tm_wday and tm_yday fields if any of the year, month, or day elements changed.

BINGO

The fix was to simply memset the structure before usage:

time_t parse_date(const char *str) {
    struct tm t;
    memset(&t, 0, sizeof(struct tm));

    if (!strptime(str, "%a, %d %b %Y %H:%M:%S %Z", &t)) {
	    fprintf(stderr, "Invalid date: %s.\n", str);
	    return 0;
    }

    return mktime(&t);
}

or better yet, once I converted it to C++:

time_t parse_date(const char *str) {
    struct tm t = {};

    if (!strptime(str, "%a, %d %b %Y %H:%M:%S %Z", &t)) {
	    fprintf(stderr, "Invalid date: %s.\n", str);
	    return 0;
    }

    return mktime(&t);
}

Finally, I had fixed my bug. I was happy. But I did have one last lingering question…

Why was it off by exactly 3600 seconds every time it was wrong?

I took a look a the structure and it finally hit me…

struct tm {
    int tm_sec;    /* Seconds (0-60) */
    int tm_min;    /* Minutes (0-59) */
    int tm_hour;   /* Hours (0-23) */
    int tm_mday;   /* Day of the month (1-31) */
    int tm_mon;    /* Month (0-11) */
    int tm_year;   /* Year - 1900 */
    int tm_wday;   /* Day of the week (0-6, Sunday = 0) */
    int tm_yday;   /* Day in the year (0-365, 1 Jan = 0) */
    int tm_isdst;  /* Daylight saving time */
};

It was the tm_isdst field! If the struct didn’t have that field initialized (and it didn’t), then mktime would randomly apply DST offsetting based on what bytes happened to be there when the function ran.

So the moral of the story is don’t assume library code that fills in a structure for you fills in all of the struct unless it specifically says so. Better yet, assume that it doesn’t unless you know otherwise, and just zero initialize your structs.