Change Data Capture (CDC)

Hike
Hike Blog
Published in
3 min readJan 24, 2023

--

Makes data available faster, more efficiently, and without sacrificing data accuracy.

By Siddharth Pal DevOps Team, Hike

Insights from data are one of the most important parts of any product, but when you’re making a gaming app, it’s even more important to pull data changes from the source database and put them into the data warehouse in near real-time.

CDC is more efficient and faster than batch data ingestion, making it the go-to solution for data and BI teams who need to get data into the system and analyze it quickly.

We had an interesting use case where we started to even think of CDC for all our primary datasets in August ’21 and implemented it. We’ve been working on the CDC pipeline for more than a year with all the latest software changes and source data migration, which I’m happy to share with you in this blog. During this time, we have added a number of databases and switched to database sharding for more critical business applications.

Why did we decide to adopt CDC?

Use cases like the status of a deposit or withdrawal, the status of a game session for a certain game, spotting anomalies, real-time campaigns, machine learning models, operational analytics, and more need data in near-real time and also need to record every change in the data.

CDC Pipeline Setup

So, where do you start if you want to set up a CDC pipeline?

  1. Launch the min 3 node kafka cluster.
  2. Download a stable debezium connector compatible with your database version and install it on your broker servers or dedicated nodes.
  3. Establish network connectivity between the provisioned infra.
  4. Create a read-only user in the source database.
  • MongoDB:
  • MySQL:

5. Run the following command to register a new database connector
MongoDB:

  • MySQL:

6. Check the status of the connectors using

What have we achieved?

Moving from traditional ETL to steaming solves these two big problems for us:

  1. Periodic spikes in load: Large queries impact the latency and ultimately the user experience, which is why we set up an additional secondary server to schedule batch jobs.
  2. Delayed business decisions: Business decisions based on the data are delayed by our polling frequency.

How to monitor the health of the CDC pipeline?

Run the curl call to monitor the connector status in your monitoring tool.

Next-Steps

CDC proved to be an effective platform for us, and we knew we could do more with it.

As a next step, we are working on more ways towards the following:

  1. CDC can serve as a tool to back up data to avoid loss.
  2. Further investigation using BI tools, or as an input process to an AI/machine learning pipeline.

Sounds like something you might want to be a part of? Check out our open roles and apply here → work.hike.in 🚀

--

--