At our organization, Redis plays a crucial role in various capacities— as shared memory, for attack detection based on concentrated events, and more. To manage these diverse uses efficiently and to support dynamic scaling of our Redis clusters, we developed a blue-green deployment process using GCP Managed Instance Groups (MIGs). This process involves creating a new Redis cluster on a new MIG with the desired size alongside the current cluster MIG, switching traffic to the new one, and then deleting the previous cluster MIG. This approach was effective for all our Redis clusters.
However, a few months ago, we received a unique request from one of our development team leaders. The request was to increase the size of a new Redis cluster without losing the data. Previously, Redis had mainly been used for attack detection based on concentrated events, which involves handling smaller data windows (e.g., detecting a concentrated attack pattern in which an event occurs multiple times per second). Our company's account protection-focused product focuses on detecting human fraud, which requires analyzing activities over extended periods. Thus, the information is critical and must be preserved without loss.
This requirement led us to realize the need to support a different type of Redis cluster that we hadn't managed before, which we termed "non-volatile" clusters. Recognizing the critical nature of the data in these clusters, we also concluded that a robust backup solution was necessary, which I will discuss in a subsequent blog post.
As always, when faced with a new requirement, we began with in-depth research. This Redis cluster contains over 200 million keys, updated frequently, and runs on 180 custom-1-12288-ext type instances. We were looking for a solution to support seamless scaling without data loss and enable quick backup and restore. We explored various solutions, including managed Redis services and alternatives like Redis Cloud, DragonflyDB and others. However, each option had significant drawbacks: security concerns, latency issues, or prohibitive costs for managed services.
When no off-the-shelf product met our needs, we returned to the drawing board. Initially, we used a feature in our application that allowed writing data to two different clusters. Together with the development team that asked for the new capability, we developed this mirroring feature specifically for this use case. We created a process that backed up the Redis cluster's slots to a bucket, created a new cluster from scratch, and restored the data into the new cluster. The application was then deployed with a mirroring feature to use both clusters. We monitored the data gap between the clusters, resulting from changes in the cluster and the time lag between the backup and mirroring. After reaching approximately 95% data similarity (a process that took around 12 hours), we performed a cut-over, deploying the application again and decommissioning the old cluster.
This approach had several manual steps, required running two clusters concurrently for 12 hours, increased costs, and still resulted in about 5% data loss. It was clear that this method was neither atomic nor robust and it compromised data integrity, necessitating a different solution.
To address these challenges, we developed an online scaling tool for Redis that seamlessly increases or decreases the cluster size without data loss, ensuring continuous operation. This redis-cluster-scale tool is a Python application designed to scale Redis clusters out or in. Instead of creating a new managed instance group, as we do in the blue-green deployment to increase and decrease cluster size (which results in data loss), this process keeps the same managed instance group and adjusts its capacity. This process does not involve any data loss. Additionally, the process is fully automated and does not require manual intervention:
Since the scale-out and scale-in processes are compute-heavy, and we wanted to make them as fast as possible, we decided to use a dynamic agent in our Google Cloud environment with a high CPU instance type (n1-highcpu-96).
A dynamic agent is a temporary, on-demand worker node that can be provisioned and decommissioned as needed. We used it to ensure that the compute-intensive tasks of scaling the Redis cluster could be performed quickly and efficiently by leveraging the high CPU resources of the n1-highcpu-96 instance type.
@retry(stop_max_attempt_number=5, wait_fixed=5000)
def rebalance_slots(self, cluster_node):
command = ('redis-cli --cluster rebalance {} {} --cluster-use-empty-masters'
.format(cluster_node, self.cluster_port))
exit_code = run_shell_command_until_complete(command, "No rebalancing needed")
logging.info("Rebalance command exited with code: {}".format(exit_code))
This code handles the scale-out process in a Redis cluster by rebalancing slots using the redis-cli --cluster rebalance command with the --cluster-use-empty-masters option. This ensures that new nodes are used for redistributing keys.
@retry(stop_max_attempt_number=5, wait_fixed=5000)
def rebalance_slots_using_weights(self, cluster_node, nodes_to_reduce, nodes_to_keep):
nodes_to_reduce_with_weights = [string + "=0" for string in nodes_to_reduce]
nodes_to_keep_with_weights = [string + "=1" for string in nodes_to_keep]
command = ('redis-cli --cluster rebalance {} {} --cluster-weight '.format(cluster_node, self.cluster_port)
+ ' '.join(nodes_to_keep_with_weights) + ' ' + ' '.join(nodes_to_reduce_with_weights))
logging.info("Rebalance with weights command: {}".format(command))
while True:
exit_code = run_shell_command_until_complete(command, "No rebalancing needed")
if exit_code == 0:
logging.info("Rebalance command ended successfully with code: {}".format(exit_code))
break
else:
logging.info("Rebalance command exited with code: {}. Continue rebalancing...".format(exit_code))
This code ensures that during a scale-in process, keys from nodes that need to be removed are redistributed to the remaining nodes using the redis-cli --cluster rebalance command with appropriate weights. Nodes being removed are given a weight of 0, while nodes that are staying are given a weight of 1, facilitating a smooth redistribution of keys.
Our journey from a simple blue-green deployment process to supporting persistent data Redis clusters and developing the redis-cluster-scale tool highlights the importance of adaptability and innovation in managing complex infrastructure requirements. This tool has allowed us to scale Redis clusters out and in seamlessly, ensuring data integrity and continuous operation. This has allowed us to quickly respond to changing customer demands and business conditions. Additionally, it has allowed us to reduce our operational costs and increase efficiency.
This graph from our production environment demonstrates how scaling down the Redis cluster size lowers overall cluster costs, resulting in a 38% cost reduction.
Stay tuned for the next blog post, where we will delve into our backup solution for these critical Redis clusters.