Graphite is an end to end solution for storing, analyzing and aggregating timed data. There are many other tools out there. The familiar ones are CACTI (http://www.cacti.net/), RRDTools (http://oss.oetiker.ch/rrdtool/) and others.
Graphite has taken the solution to a new level on the architectural plain. Graphite is not only a database solution but it is a full application solution, including web interface, security, clustering and more. For a more in-depth overview of graphite see https://graphite.readthedocs.org/en/latest/overview.html.
So what type of information do you want to store in this database?
Answer: anything. You can use graphite to save metrics on anything. Depending on your application, you can monitor your cpu, disk… You can sent ticks from your app to notify process progress, and then monitor the speed in graphite.
- You can even send Windows Performance counters to graphite: http://www.hodgkins.net.au/mswindows/using-powershell-to-send-metrics-graphite/
- Want to monitor you storm server no problem: http://www.michael-noll.com/blog/2013/11/06/sending-metrics-from-storm-to-graphite/
- Are you using logstash to analyze your logs, sent those also to graphite: http://logstash.net/docs/1.2.0/outputs/graphite
- Are you using sensu to monitor you farm: http://www.joemiller.me/2013/12/07/sensu-and-graphite-part-2/.
- As you can see it is all a matter of your imagination.
So what is and what isn’t graphite. Graphite does not do the actual collection of the data (if you need tools for this, see https://graphite.readthedocs.org/en/latest/tools.html). Graphite supplies the option to store data and to query the data. Since the data that you store can be very large, graphite has a built in option for retention. Per metric you can decide what the resolution is and for how long you will keep it. So for example you can define
This will save the data every 10sec for 14 days (for more info see http://graphite.readthedocs.org/en/latest/config-carbon.html#storage-schemas-conf).
This way you also don’t have to worry about deleting old data from the database, as is the case in most time based solutions.
Once you have defined the retention of your data, you can then define an aggregation function for your data. This way you can keep your raw data up to a month, but you can then keep a daily average for the next year. The basic aggregation functions that are supported are: average, sum, min, max, and last.
Graphite aggregation also supports combining multiple metrics into a new one via the aggregation definition which will save process time later on (the request of data will not need to use the aggregation function when retrieving the data).
The basic components of the graphite server are:
- carbon – daemons that listen for time-series data over the network using multiple protocols.
- whisper – database library for storing time-series data
- graphite web - application that renders graphs using a simple url api
The carbon daemon support two main protocols: plaintext, pickle.
Plaintext is a simple TCP socket that receives data in the format of:
Pickle is a python format for encoding strings of the following format:
This format allows for inserting many timestamps of the same metric in an efficient way.
Although not documented but also the plaintext supports sending multiple metrics in the same TCP packet with a new line separator.
Many implementations of these protocols can be found in the internet for multiple languages.
Whisper is not an actual database, but is a library that is optimized to write time based files. Each metric is written to its own file. Each file is a fixed size based on the retention rule. This way the writing to the file is optimized (location for each metric in the file, based on timestamp is know in advance). This means that the allocation of the file is done on the first metric that is sent for this file (a utility to help calculate the file size based on retention can be found at: https://gist.github.com/jjmaestro/5774063).
The folder structure is very convenient. If your metric is a.b.c, then you will have a file named “c.wsp” in a folder of “b” in a folder of ”a”. For what every reason, if you wish to remove the metric data, you just need to delete the file.
Since the whole architecture of graphite is like Lego blocks, any part can be changed. So if you want to implement your own database library, you can go and do it (see http://graphite.readthedocs.org/en/latest/storage-backends.html).
For an example of it (and an in-depth article on whisper) see http://www.inmobi.com/blog/2014/01/24/extending-graphites-mileage.
Since graphite is designed for high rate writing, obviously the IO will be the bottle neck. To solve this, graphite has added the carbon cache. All writes and reads go through the cache. The cache will persist the metrics to disk after a configurable interval. The cache holds a queue per whisper file, so that writing will be optimized and written in one block.
In the carbon.conf file you can configure multiple options to fine tune your graphite performance. An important entry is the following:
This entry will define the updates per second to the disk. The less writes to the disk the better performance, but it comes with the risk of losing data in case of crash.
For fine tuning see the following article: http://mike-kirk.blogspot.co.il/2013_12_01_archive.html.
Since the architecture is that each metric has its own life cycle, we can store metrics on different machines, or for performance we might have more than one cache (see section on performance boost and high availability).
Graphite uses Python Django web application with a REST API that can be queried to generate graphs as images, or return raw data in various formats (csv, json). The main user interface can be used as a work area to compose URLs for metrics retrieval.
The web api, can read from either the whisper file or the carbon-cache so that it can access data that has not yet been persisted.
The Web API has the option to display a GUI dashboard, or to retrieve the data via REST interface.
Getting data from graphite is as simple as:
There are of course many options that include getting multiple metrics with wildcards. Defining time period for metrics. Choosing the format of the reply (json, png, csv, raw). Applying functions to metrics before retrieval, and many more. For more information see documentation at: http://graphite.readthedocs.org/en/latest/render_api.html.
If you want to enhance your dashboards, have a look at this open source graph editor: http://grafana.org/.
To boost the performance of graphite, it is recommended to create a carbon-cache per cpu core. This way the machine can handle more metrics at the same time. You will need to configure a port per carbon-cache (actually 2, one for plaintext and one for pickle). This is a problem since our clients do not want to be aware of this layer in the architecture. To solve this graphite uses the carbon-rely. The client needs to see only the carbon-rely, and then the relay will send the metrics to the different carbon-cache.
And for carbon relay we have the following:
The first part, is once you send a new metric to graphite, you can check if the whisper file was created.
Next step: in the carbon.conf you have the following flags:
To view the logs go to ../storage/log. Here you should have three log folder: carbon-cache, carbon-relay, webapp. Under each one we have more folders per instance of application.
For example, debugging cache. Go to folder /storage/log/carbon-cache/carbon-cache-1. To debug your application sending metrics to graphite you can use listener.log. If there are any connection failures you should see them in this file. Also in case of invalid formats sent to graphite, you will see the error here.
If you need to use graphite for multiple customers, you can easily do this by adding a prefix to the metric name. You just need to remember that this is a solution on the application layer and not in graphite. So if you give a direct connection to the graphite, you cannot block the data per client.
Graphite does have a simple mechanism for saving basic events. The basic structure of events is: when, what, data, tags. There is a dedicated GUI for viewing the events. You can also use the rest api to query events. The events are not the center of graphite and therefor do not have all the features that would be expected from an events system. So you if need to do anything more than a simple event you should look for a more robust system (like elasticsearch). For more information see:
If your servers have access to the internet, and you do not want the hassle of setting up graphite and maintaining it, you can always go the hosting way.