"When you're entertaining more than 200 million people with your music streaming service, the stakes are a little higher than usual."
Being an early adopter of containerized microservices, Spotify was up for a great start. The year was 2016 and like everybody else, those web services were hosted across its fleet of VMs.
Helios was everything Spotify could expect from a container orchestration system. It had all the features Spotify needed at that time. Being a homegrown orchestration system, adding new features was no big deal for the small team at Spotify dedicated to the project.
Spotify addressed the issues with monolithic applications when microservices and containerization were still buzzwords. They saw the potential and started disintegrating their web services into microservices and putting them into Docker containers.
2017 was the year when Kubernetes was still new to the block. Spotify could have given the new container orchestration system a passé with Helios around the block.
Sure, Kubernetes had a growing community, but it was nowhere the shining new way to manage containers it is today. Then again, Kubernetes had all the bells & whistles blowing that inspired Spotify to adopt microservices architecture three years ago.
By mid of the year, the team Spotify had made up its mind. Developing Helios wasn't worth the efforts when Kubernetes is backed by a one-hell of a community. It had more features than the team was ever gonna need. Of course, today, Spotify won't mind the decision it took a few years back. But migrations aren't very desirable when more than a million people are hooked to your service at any given point. Then again, music streaming is quite a crowded market and when you are up against the likes of Google, Apple, Microsoft, and Amazon, it is a matter of survival. You have to push new features faster than these multi-billion dollar giants do and yet remain competitive. Then there are industry-wide best practices you must adhere to. Moreover, Spotify can reassign its Helios engineering as contributors to Kubernetes open-source project and extend its influence into the community. Early to the community means the music streaming app will have an upper hand in future decisions. It had to be now or never for the team Spotify and Kubernetes ticked them all.
Kubernetes migration did not kick in until late 2018. The rest of the year went by working around core migration issues. When you're migrating more than 200 million users with their own Playlists, favorites, preferences, etc. to a new platform, there are so many things you can break. The team fixated on a parallel migration from Helios to Kubernetes.
The migration is still going on and is expected to take much of 2020 and there is already good news coming.
From the Spotify services that have been migrated to Kubernetes, developers are adding more new features than ever thanks to a greater velocity and auto-capacity provisioning. What would take teams hours on Helios takes minutes even second with Kubernetes. They can now literally test a feature and it is ready to go into production. With continuous delivery, this can happen literally happen in seconds.
Moreover, Kubernetes's bin-packing and multi-tenancy capabilities have been a blessing for team Spotify. The biggest service currently running on Kubernetes takes about 10 million requests per second. With Kubernetes, CPU utilization is down 2-3 times. Services at this scale were impossible in Helios.
"Did you know Goldman Sachs technology division constitutes one-fourth of its total workforce?"
In 2015, the division employed more than 10,000 engineers. The same year Facebook had only 13,000 employees. If Spotify's staggering 200 million users made you nervous, Facebook had close to 1.5 billion active users at its disposal during the same period. Yet they were fine at 13, 000. Goldman Sachs' XL-sized technology division tells an interesting story about the scale at which the company's IT infrastructure, software resources, and service portfolios operate.
When Goldman Sachs announced that it wants to migrate its computing resources to Docker containers, it did not come as a surprise since the banking firm already had a stake in the container-technology startup. The year-long migration project officially started in the winter of 2015. Amidst Manhattan snow, the engineers at Goldman undertook the most ambitious technology migration project in the history of the American banking sector. The migration project from Goldman's internal cloud, the Dynamic Computing Platform to Docker containers would include close to 5,000 applications, a greater part of the firm's software infrastructure or 90% of computing assets.
Being a yearlong project Goldman engineers can start small and gradually ramp up their migration efforts. In the first two months, the team migrated 2-3 applications but that won't go well along considering the sheer size of the application portfolio. The team plans to accelerate the migration gradually. Once the migration is complete, it will empower 8,000 software developers at the banking giant to focus on creating new products and tools to automate software delivery, thus reducing the cost of labor and infrastructure.
When you're operating at this scale even a slight improvement in efficiency can result in serious cost-cutting. When a hedge fund needs access to a lot of virtual servers spanning several applications, Goldman can rely on container orchestration, like Docker Swarm or Kubernetes to generate the needed infrastructure on demand. Without such a service, the exercise would take months to execute and would take a toll on the bank's finances.
For a bank that is a direct investor in Docker Inc, picking a competitor isn't the most desirable prospect but they can't help when they have to deploy thousands of containers hundreds of machines. Engineers at Goldman balance the equation by employing Swarm for smaller deployment constituting 3-4 containers in a few machines and Kubernetes is for everything else. Of course, that was 5 years, now we all know Docker has embraced Kubernetes with open arms and is, in fact, one of its biggest promoters.
If we forget for a while that Goldman Sachs isn't a bank, it wouldn't be hard to call it a technology company.
The largest Kubernetes deployment on Google Container Engine ever
Google created Customer Reliability Engineering (CRE) to bridge the gap between Google technical teams and customer teams. The idea was to allow shared responsibility between the teams for the reliability and success of critical cloud applications.
CRE was up for a great start except its first client was Niantic. Who hasn't heard about Pokémon Go but nobody knows how it came about to be the most popular video game ever?
If you have ever launched a mobile app on app stores, then you would know app adoption grows gradually over weeks, even months, and new features and architectural changes are added over time. But user adoption of Pokemon Go was never typical. Pokémon Go has not one or two but five world records to its credit, including the most downloaded mobile game and most revenue grossed by a mobile game in its first month.
Did you know Pokemon Go was also the largest Kubernetes deployment on Google Container Engine ever? This is going to be interesting.
15 minutes into Australia and New Zealand launch and the game had already exceeded expected user traffic, giving the team a fair idea of what the US launch would like the next day.
Perhaps, too many people wanted to catch-em-all. Well, they called Google CRE for help and they provisioned the capacity anticipating US launch.
They expected the traffic to surpass expected volume but never the worst-case scenario, which was 5X. But when the traffic passed the worst-case scenario and started approaching a mark never seen in a mobile app before, they knew they had a problem. Americans are going crazy after catch AR-emulated Pokémon's and Niantic and Google couldn't simply keep up. The Player reported all sorts of issues subpar performance, unable to sign in, crashes, etc.
Niantic and Google engineers worked together around the problem but that did not stop thousands of players from joining the game every day and Japan's launch was just around the corner. Niantic did want the issues to arise the third time. Not to mention, GCE integrity was at stake with its very first client. Niantic chose GKE for its ability to orchestrate its container cluster at an unparalleled scale so that its team brings live changes for its players.
Google Cloud means Niantic can continuously adapt and improve Pokémon GO regardless of the number of users or amount of traffic. So far, that doesn't seem to be happening. But Google Cloud was as clueless as anybody else. Pokémon Go was also the first for them and they had never provisioned capacity at this scale and they were already limited by their current implementation of GKE. Are they going to postpose their launch in Japan, the birth ground of Pokémon?
Upgrades and migrations can make or break an application. Nevertheless, Niantic and the Google CRE team had to upgrade GKE to allow additional nodes added to the cluster. What if tell you to replace the engine of a car? Simple, ha? Did I tell the car is in motion? Thanks to intelligent capacity provisioning, upgradation to the latest version of GKE, the game launched in Japan without incidents even when there were three times players at the time of US launch. Kubernetes is a reason why games like Pokémon Go and Fortnite are a reality today.
The idea of Kubernetes revolves around accelerating deliveries in our developmental workflows. It's about prioritizing what your users seek rather than what you can deliver with your current capacity. We are indeed in the golden age of software development. If you can think about it, you can deliver it.