May 5th, 2023 7:36am by Patrick McFadin
From dynamic data masking to ACID transactions, Apache Cassandra committers discuss some of the big and exciting changes in the upcoming 5.0 release.
Ask 10 different people their opinion and you’ll get 10 different answers. The Apache Cassandra open-source project is built and maintained by individuals who are motivated by different things. Some love new features. Some love squeezing all the performance they can out of the system. Some want to make operators’ lives easier. What ties them together? They’re working as a distributed team toward a single goal: an amazing database that keeps getting better.
Cassandra is a collaborative effort by engineers from different parts of the world who share a common goal of creating the best product possible. They tackle problems for their employers while contributing to the open-source code for the project. Those who earn the trust of the community and can make changes to the base code are called “committers.” Becoming a committer requires dedication and passion for the project. Recently, the project held an event called Cassandra Forward, where some of the committers shared their insights on the upcoming release of Cassandra 5.0. Here’s what they had to say.
Jon Haddad: Java 17 Support and Garbage Collectors
Haddad tells us he’s looking forward to supporting for Java 17 and its low-latency garbage collectors like ZGC in Cassandra 5.0. The former Netflix and Apple developer, who’s been a Cassandra committer since 2017, says these collectors will provide sub-millisecond pause times and a “set and forget” model, making memory management less overwhelming for Cassandra users. As the project matures and memory management gets better, there will be improvements to the duration and frequency of GC pauses, making it easier to run denser nodes, which will save money for users.
“That means we’ll see less frequent GC pauses — and when they happen, they’ll take less time. This will make it easier to run denser nodes, meaning your cluster will be less expensive to run. I love the idea of saving money just by doing an upgrade.”
Andrés de la Peña: Dynamic Data Masking
De la Peña, a DataStax software engineer and a Cassandra committer since 2016, is enthusiastic about the dynamic data-masking feature in Cassandra 5.0, which enables sensitive information to be obscured while still allowing access to the masked columns. This feature replaces the real values of columns with generic data using a series of regular SQL functions that transform the cell values. Administrators can attach these masking functions to the columns of the table schema so that unprivileged users will always see masked data, even if they don’t specify the functions in the query. The set of available masking functions is relatively small at the moment, but users can use their own user-defined UDF functions for masking, making it easy to add custom types of masking.
“It is a security anonymization feature that is available in many databases out there and is long overdue in Cassandra.”
Vinay Chella: Guardrails
Chella, a senior engineering leader at Netflix and a committer since 2019, is excited about the new features in Cassandra 5.0 that provide more guardrails for developers, improve stability and enhance the operating experience. The introduction of guardrails in Cassandra 4.1 allowed for soft and hard limits on user actions, and Cassandra 5.0 adds several new guardrails to increase reliability, availability and user experience. These guardrails codify best practices and avoid catastrophic mistakes, such as dropping production-critical key spaces or losing data.
“These guardrails certainly help prevent lots of these ‘oops’ moments.”
Mick Semb Wever: Community
Semb Weaver, a Cassandra committer since 2016 and a principal architect at DataStax, appreciates how Cassandra 5.0 embodies “real open source” by having multiple vendors, companies and employees behind its contributors. This creates a diverse development community with a rich set of features and applications, and emphasizes the importance of engineering hygiene, building QA and CI to improve trust and enable radical features. He says these principles and practices will lead to greater longevity, sustainability and modernization of the technology, and that it encourages diversity and collaboration in the community.
“It’s what’s enabling some of the radical features that are coming in 5.0 — stuff like Accord — that we can’t get over the finish line if we’re not all working together as a team.”
Jordan West: More Sleep!
West, a senior Netflix software engineer and a Cassandra committer since 2020, is excited about how improvements in Cassandra 5.0 will lead to better reliability and performance, which will result in more sleep for him as an on-call engineer. He highlights the new transactional metadata feature and the improved memetables that will allow faster writes. He also describes how the new virtual tables, diagnostics and metrics will provide more insight into Cassandra and help resolve incidents faster.
“I know with Cassandra 5.0 [that] when I go to bed, I’m less likely to get woken up — and when I do, I’m going to solve our problems faster and get back to bed faster.”
Ekaterina Dimitrova: Accord and ACID Transactions
A DataStax engineer who has been a committer since 2020, Dimitrova is eagerly anticipating the community’s implementation of the Accord protocol. This protocol will enable global consensus and allow ACID transactions to be carried out at scale, making developers more efficient without compromising on performance or scalability. Global consensus is crucial in things like bank transfers; concurrency guarantees ensure that only one process can make changes at a time. The new syntax being created for developers will include begin and commit transaction declarations, which allow all operations within the declaration to be fully ACID compliant.
Lorina Poland: Unified Compaction Strategy
Poland, a DataStax tech lead who became a committer in 2021, likes the benefits of Cassandra 5.0’s Unified Compaction Strategy (UCS), which combines old legacy compaction strategies like CT, size-tiered and level compaction strategies. UCS is a significantly faster compaction strategy that has reduced space overhead and allows for parallelism. The strategy also has a scaling factor that can be tuned to specific workloads, whether they are read-heavy or write-heavy or both. There’s no need to know how the legacy strategies work, and there is zero overhead for migrating to UCS.
“If you need it to be write-heavy, you can tune it to that; if you need it to be read-heavy, you can tune to that; and if you just want something in between, it works well for whatever your workload is.”
Benjamin Lerer: Storage-Attached Indexing
Lerer became a committer eight years ago. The DataStax tech lead notes that storage-attached secondary index (SASI) was added in 2016, but it was not invested in enough and had to be marketed experimentally in Cassandra 4.0 as it didn’t meet desired standards. SAI was built atop SASI and has its own set of innovations, including the ability to index multiple columns without scalability issues and optimization for space usage and numeric crunch queries.
“SAI will enable a new set of query capabilities, without the drawbacks that secondary indexing or SASI had.”
Branimir Lambov: Pluggability
Lambov, a DataStax engineer who’s been a Cassandra committer since 2015, is excited about the local storage pluggability in Cassandra 5.0. The change centers around the memtable, which is a temporary storage area in the computer’s memory where data is stored before being written to more permanent storage. The goal of the new implementation is to make it easier to use different types of memtables and select the best one for each specific use case. One of the new implementations is based on a Trie data structure, which provides a much more efficient way of storing data. It also enables memory to be used off of the main Java heap, resulting in no garbage collection for storage operations. These improvements can double the write throughput of the database. It will be exciting to see where the community takes this flexible storage interface next.
In OSS, People Make All the Difference
Exploring a successful open-source project is a captivating journey, from both the human and technological viewpoints. While technology may be the initial focus of a software project, it’s the people involved who make it truly fascinating. Each person brings their unique emotions and desires to the table, which can result in positive or negative outcomes. In an open-source project, individuals’ desires to improve something are laid bare and open to criticism. However, it’s through the determination to work together and move forward that the true magic of the project occurs.
What features are you looking forward to in Cassandra 5.0? Personally, I’m excited for the developer improvements that will be game-changing, such as ACID transactions, new indexing schemes and new syntax like the NOT operator. As a Cassandra committer myself, I enjoy watching developers use these new features to create amazing things. If you haven’t checked out Cassandra in a while, now is a good time to do so. Join the rest of the user community over at Planet Cassandra and share your thoughts on what excites you about Cassandra 5.0.
About the Author:
Patrick McFadin
Patrick McFadin is the co-author of the O’Reilly book ‘Managing Cloud Native Data on Kubernetes.’ He currently works at DataStax in developer relations and as a contributor to the Apache Cassandra project. Patrick has worked as chief evangelist for Apache Cassandra and as a consultant for DataStax, where he had a great time building some of the largest deployments in production.