“Reliability at scale is one of the biggest challenges we face at Amazon.com, one of the largest e-commerce operations in the world; even the slightest disruption has significant financial consequences and affects customer trust.”
This was the first line of the high-impact paper titled “Dynamo: Amazon’s High-Availability Key-Value Store.” Published in 2007, it was written at a time when the status quo of database systems wasn’t working for the massive explosion of Internet-based applications. A team of computer engineers and scientists at Amazon has completely rethought the idea of data storage in terms of what is needed for the future, with a firm foundation in computer science in the past.
They were trying to solve an immediate problem but inadvertently launched a massive revolution with distributed databases and the eventual collision with cloud-native applications.
Native Cloud Native Database
A year after the Dynamo paper, one of the authors, Avinash Lakshman, joined Prashant Malik at Facebook and built one of many Dynamo apps, called Cassandra. Since they worked at Facebook, they had issues with the scale that very few companies were dealing with at the time. Another principle on Facebook in 2008: move fast and break things. Reliability That Was High on Amazon’s Wish List for Dynamo?
Facebook was challenging that daily with non-stop frenetic growth. Cassandra is built on cloud-native principles of scale and self-healing – keeping the world’s most critical workloads close to 100% uptime and having been mitigated in the hottest fires. Now, with the release of Cassandra 4.0, we are seeing the start of the next step for an installed database and cloud-native applications to be built in the future. The stage was set for a wide range of innovations – all built on the shoulders of a giant dynamo.
Prima Donna comes to Kubernetes
It can be said that the previous generation of databases before the NoSQL revolution drove a lot of innovation in the data center. It used to be that most of the time and money was spent on the “big” database server required to keep up with the demand. We created some amazing data insufficiencies on bare metal, which made the push to virtualize database workloads difficult in the early 2000s.
In most cases, the database infrastructure sat on dedicated hardware next to the application’s virtual systems. As cloud adoption has grown, similar issues have persisted. Temporary cloud instances worked great for web servers and applications, but “commodity” was a terrible word for your precious database. Moving from virtualization to container increased cries of “Never!” for database teams. Kubernetes bravely advanced with stateless workloads, and databases remained on the sidelines once again. Those days are numbered now. Art debt can grow infinitely if left unchecked. Organizations don’t want to manage multiple versions of infrastructure – they require hiring more people and tracking more items. When deploying virtual data centers using Kubernetes, the database must be a part of it.
Some objections are valid when it comes to running a database in a container. The reasons why we build specialized hardware for databases are the same reasons we need to care about certain parts of a containerized database. High performance file systems. Placing the system away from other containers can lead to potential conflict and reduce performance. With distributed databases such as Apache Cassandra, placing individual nodes in such a way that hardware failure does not affect database uptime.
Proven databases before Kubernetes are trying to find ways to run on Kubernetes. The future of databases and Kubernetes requires that the word “on” be replaced by “in” and the change must occur on the database side. The latest in “Runs on Kubernetes” technology is using operators to translate the way databases want to work into what Kubernetes want them to do. Our bright future in “Runs in Kubernetes” means that databases use more of what Kubernetes has to offer while managing and coordinating resources for the basic operation of the database.
Ironically, this means that many databases can remove entire parts of their code base while handing this functionality to Kubernetes (reducing the surface area for potential bugs and security flaws).
Cassandra is ready for what’s next
The fact that Apache Cassandra 4.0 was recently released is a huge milestone for the project when it comes to stability and a mature database. The project is now looking forward to future Cassandra releases built on this solid foundation. In the first place, how can it support the larger ecosystem around it by becoming a solid foundation for other data infrastructure? Over the past decade, Cassandra has earned a solid reputation as a high-performance, flexible database. With the kinds of cloud-native applications we need to write, we’ll just need more of that – interoperability will only become more important to Cassandra.
To think of what Cassandra native will look like in the cloud, we must look at how applications are deployed in Kubernetes. The idea of posting one rusty monolith should be left in the same heap as my old Sun E450 database server now. Cloud-native applications are modular, illustrative, and adhere to the principles of scale, resilience, and self-healing. They get their control and coordination from the Kubernetes group and share with other parts of the application. The need for capacity is directly related to the needs of the running application and everything is coordinated with the overall application. The virtual data center operates as a single unit but can overcome and work around basic hardware issues.
The ecosystem as a first-class
Cassandra’s future in Kubernetes is not about what she does alone. It is about the new possibilities that it presents to the system as a whole. Projects like Stargate create a portal for developers to build API-based applications without interacting with the underlying data store. Data as a Service is deployed by you, in your virtual data center using Kubernetes. Cassandra herself might use enabling projects like OpenEBS to manage database class storage or Prometheus to store metrics.
You may even find yourself using Cassandra without being part of your app. Projects like Temporal use Cassandra as a primary storehouse for its continuity. When you have a data service that is easily deployed and scaling across multiple regions, this is an obvious choice.
From the spark of innovation that started with Amazon’s Dynamo paper to the latest version of 4.0, Cassandra was destined to be the original cloud database we all need. The next 10 years of data on Kubernetes will see more innovation as we take the ivory short of the database server and make it an equal player as a data service in the application stack.
Cassandra is built for this future and is ready to go with the most stable database release ever made in 4.0. If you are interested in joining the data for the Kubernetes revolution, you can find an amazing community of like-minded individuals in the Data on Kubernetes community. If you would like to help make Cassandra the default Kubernetes data store, you can join us on the Cassandra project or more specifically on the Cassandra on Kubernetes project, K8ssandra.
If you’re new to Cassandra, the Astra DB is a great (free) place to learn with no infrastructure setup issues.