Time-Based Data Aggregation with Cassandra

For collecting time-based data, for example, such that is streamed from single or multiple resource monitors including CPU, disk usage etc., data volumes can pretty quickly become daunting and unusable even with the most modern relational database.

In such cases, using a NoSQL DB may be a good idea. Modeling the data isn't as trivial as inserting each new time-based sample into a new row - rather, the recommended model is actually aggregating the data horizontally by adding new columns whose name is the collection timestamp and the value is the measured metric.

Below is a summary of all resources I've used to develop a solution for the problem of this type using Cassandra DB - for some reason they were not easy to gather, hoping this summary may help whoever is implementing a solution for a similar problem.

 

  1. Basic Time Series with Cassandra:
    http://rubyscale.com/blog/2011/03/06/basic-time-series-with-cassandra/
  2. Modeling Time Series Data on top of Cassandra:
    http://engineering.rockmelt.com/post/17229017779/modeling-time-series-data-on-top-of-cassandra
  3. Cassandra Developer Center :: Advanced Time Series with Cassandra
    http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra

  4. Also, it is probably a good idea to read about Cassandra's Time UUIDs, which you will need to use as column names, explained in:
    http://wiki.apache.org/cassandra/FAQ#working_with_timeuuid_in_java
    http://stackoverflow.com/questions/2614195/in-cassandra-terminology-what-is-timeuuid
  5. Lastly, if using Hector as Cassandra driver (recommended), it probably makes more sense to use its me.prettyprint.cassandra.utils.TimeUUIDUtils

 

Hopefully, a more concrete code example will soon follow soon...

 

Developer