Open Source Sunday: Updating Cassandra Copy Tool
Open Source Sunday is a series covering my work on open source software. Over the years I’ve developed software for some pretty cool and sometimes high profile companies, but rarely for the open source community. I’d like to use this series to share valuable tools I’ve created or contributed to.
A little background…
The cassandra-copy-tool was developed back in 2016 to provide me an easy way to copy data between two cassandra tables. It came in handy during my time working with cassandra, so much so that I decided to open source the work to my GitHub.
The tool provides the following features:
Ability to copy from one table to another, regardless of their locations (cluster/schema)
Ability to copy multiple tables in one run
Ability to ignore columns per source table
Data Throttling
Source and sink cassandra configuration (contact points, port, schema name, username, password)
Easy to execute from command line
Provides a well documented property file format to describe the transformation parameters
2 years later…
It’s been two years since I put the tool out into the world. Recently, I’ve realized that open source would be a great way to scratch the itch that sometimes my full-time job can’t. With a new wind in my sails, I decided to take a look at my old project. I found that there had been a small amount of interest. On GitHub I found 5 forks, 1 reported bug, and 1 pull request. I decided to jump back in manage this project properly. After a deep dive I found two major issues.
Cassandra 3 and a bugfix
The first issue was described in a github issue by another developer. The tool did not work with cassandra 3.x. This was a major issue since Cassandra 3 had been out for many years and was likely to be the preferred major version of the product. I updated the internal dependencies to use the cassandra 3 driver, which is also backwards compatible, meaning cassandra 2 is still supported!
The second issue was a bug I ran into while testing the cassandra driver upgrade. I had left a simple bug in the application where the parsing of the copy.tables property was broken. This bug had no effect on single table copies, but effectively meant multiple table copies was broken. I’ve since fixed the relevant parsing logic so that the source-sink pairs are always correct and will now fail gracefully if you provide an invalid copy.tables property.
Conclusion
I had a ton of fun getting into my code, cleaning up where it made sense, and ensuring that this tool can be useful going forward. Open source is something I’ve relied on for years, but historically I haven’t contributed much. I’ve learned that when you open source your work, you need to be as attentive as any other software project. In the future I plan to check in on my projects every month to ensure my contributions are providing maximum value.
Here’s my github repository. As always contributions are welcome.
Github: https://github.com/wildengineer/cassandra-data-copy-tool