What Goes Down, We Made Go Up

The Story of Terminals Uptime

The Start

Hitting your favourite website and finding it totally unresponsive, or even completely down, sucks. When it’s the website you’re paid to maintain and improve, it’s a completely different nightmare. This was our situation when I joined the Terminals team in 2018.

Starting at Terminals was a dream career move for me. I adore video-games, I dream in code, and I’m extremely passionate about pushing optimisations, standardisation, and speed improvements. However Terminals.io was in an interesting state when I first delved into the code. Let’s talk about that.

The Problem

Our problem was uptime. At its lowest point, we were hitting a pathetic 77% uptime. That’s frankly disastrous. To put that into perspective, that’s 7 full days (and 41 seconds) of downtime a month. Thankfully that was the exception more than the average. Our average was closer to 92-94%, which is still around 2 full days of downtime per month. Completely unacceptable.

We had two main challenges to deal with before any real features could be worked on. Intermittent DDoS attacks, and servers hitting 100% CPU usage during peak times. I started with the DDoS attacks, as we couldn’t really continue if we have no access to the servers themselves.

Cloudflare to the Rescue! Kinda..

Terminals problems were legion, and I was the only person paid to fix them (I’ll elaborate on that later). First, a little history of Terminals’ codebase. Terminals was built in 2014, and launched in 2015, using a proprietary PHP framework. While that’s not abnormal, especially for that time, the framework was built for another purpose. It was solidly built, but it was developed by a very small team, and not really specced for what Terminals ended up needing.

While doing my usual mitigation methods for a DDoS attack, I had to plan to execute a move to Cloudflare, on a system I was barely familiar with, an unknown codebase. Not an ideal situation to begin with, and the size of our existing infrastructure, meant we had another 4 hour downtime. Luckily I’m on an offset timezone, so I could get all this work done before my co-workers even woke up. With Cloudflare now protecting us, our problems should be over, right? Not quite.

Cloudflare is a distributed denial of service mitigation service. Which is a lot of words for “it helps limit attacks on your website”. Note I used “helps limit attacks” and not “stops attacks”. While Cloudflare can deal with obvious DDoS attempts, like a single pattern attacking a login form, our attacks were coming from a very unique source. “Internet of Things” devices. These are Nest thermostats, webcams, smart speakers, and other small devices connected to the internet. All using randomised browser information, random IP addresses, and random attack vectors. To Cloudflare, this looked like normal traffic, and that was a problem.

We spent a lot of time tweaking rules, changing configurations, and generally monitoring the website for the next wave of attacks. Eventually we found settings that worked for us, but at the cost of a lot of downtime. We had hit the lowest point. 77% uptime. I feel personally ashamed even mentioning it. However, the tides were turning.

Terminals’ servers were no longer struggling from a DDoS attack, but we had another problem. Inboxes; a feature we had in Terminals 1.0, but had to take back to the drawing board for a multitude of reasons. If you’re unfamiliar with how Terminals worked back then, inboxes were our own internal messaging system. Users could talk directly to PR reps or reply to newsletters and have it tracked internally. Sounds great, right? While it worked in concept, the method of creating that ended up being vastly flawed. Every email we sent created a wave of database read/writes and every user sending us a message caused a backlog of actions all happening at once, with no queueing system in place. Nightmare. Our infrastructure was capable but capable can’t compete with a fundamentally broken feature. Secondly it was so tightly integrated with Terminals that all I could do to fix it was patch and plan for an inbox-less future, one that eventually came in the form of Terminals 2.0. But that alone did not solve our problem of regular disruptive downtime.

The Turnaround

With the launch of Terminals 2.0, we cemented all the fixes needed to bolster our defences. We introduced an authentication gate, stronger two factor authentication, specific Cloudflare protections, and global protections for our users. After some initial teething problems with our 2.0 launch, our uptime was sitting at an average of 98%. The only cause of downtime now was when we pushed new code. Every new feature, every bug-fix needed a 3-6 minute pause while the changes were pulled in, and that adds up over time. However, this situation was extremely “livable”, compared to what we came from, while I worked on other features and bug-fixes.

The final pieces of the puzzle fell into place gradually, as I added more mitigations to the build process. For example: A build process that starts behind the scenes, then swaps out the old code for the new code instantaneously. Multiple failover servers, in case of an incident on the frontend server. The ability to roll code back, even database changes with one command. And plenty more small tweaks to the process.

Here’s the good news. Last month we hit 99.999% uptime, for the year, and we’re very proud of that. This entire post is basically an excuse for me to boast about something I’m personally very happy we’ve achieved.

The Struggle

As I mentioned above, for the majority of my time here at Terminals Technology Inc. (yes, that’s our company name, and yes we have a LinkedIn account now) I’ve been the sole frontend developer, backend developer, sysadmin, schema designer, technical planner, and any other hat I needed to wear that day. This may sound bleak, but it is not. It really isn’t. I have an excellent admin team (check out our team page), incredible management structure (Tom Ohle CEO, and Candace Ohle CFO), and the support of the entire Evolve PR team (over 30 people now!).

Why has it taken so long? Well, it comes down to a balancing act between creating new features, squashing bugs, and keeping all the server plates spinning. All things I absolutely thrive on, and honestly, I love waking up every single day to dive into the next thing. I’ve been working in this industry for nearly 20 years now, and I can safely say that this is, and will continue to be, the best job I’ve ever had.

Speaking of which, would you like to work with me? We have a job posting for a junior web developer. If you’re living in Canada, and want to join me on this adventure, I suggest checking out that posting. If you’re reading this in the distant future, and that position has been taken, we’d love to still hear from you. Drop us an email to jobs [at] terminals.io, and let’s talk!

The Future

We have been secretly building a status page for all our products, that will give everyone a public view to see our uptime numbers, along with more juicy numbers for all you data nerds (maybe that’s just me). We’ll be talking about it more in the coming months, so keep an eye on the Terminals Twitter for more information.

NB: This is my first “real” blog post, and I welcome feedback. We plan on posting more technical posts in the future, so if there’s something you would like to hear more about, let’s talk! Send us your recommendations so we know which parts of the sausage you want to see get made.