Performance Optimizations Can Have Unexpectedly Large Effects When Combined With Caches

This post is about a non-obvious interaction between performance optimizations and LRU/time limited caching.

In 2017, I was working a major performance issue. We were onboarding a large customer, and a batch process was taking 3 hours1, long enough to mean that trucks would be waiting at the dock, shipments would be fulfilled a day late, etc.

We were able to run the whole process under a sampling profiler, and found that the foo method was taking roughly 50% of the time. Fortunately, it wasn't hard to see room for improvement. The details were tricky, but after some work, I thought I had a solid improvement. I could avoid at least half of the work in the foo method, so perhaps I could shave 45 minutes off the whole process.

We uploaded code to the test environment, expecting to wait a few hours, but the entire process finished in 45 minutes. Instead of a 45 minute savings, we'd shaved off 2 hours and 15 minutes. My first instinct was that we had a bug--nothing is faster than throwing an exception, or skipping all the real work.

In fact, the code was fine. What I didn't realize is that the code pulled data from caches that expired as time went on, and a significant portion of the original 3 hours was spent fetching data that had expired from the cache. After optimizing one slow portion of the code, other portions were reaching from the cache and the overall speedup was larger than seemed possible.

It's common wisdom that systems with caches can fail badly when the caches are cold. When every request misses the cache, the system is overloaded, and never recovers. What I observed was the mirror image.

Thanks to Stuart Cook for pointing out that I was describing the opposite of the familiar cold cache situation.

Footnotes


  1. It's been a minute, and I no longer have the profiles or detailed notes. I have a good memory for figures, but it's possible that 50% was 40%, or 45 minutes was 30 minutes. But it really was 3 hours--when you're staying late to fix a problem and it takes 3 hours to get data, that sticks with you.

    Taking better notes is an important skill. I take better notes these days, but could double my effort on note-taking before I hit diminshing returns.