60 Seconds Architecture – Graphite
Contents
Overview
Data
Graphite Components
Carbon
Whisper
Carbon-Cache
Carbon-Rely
Web API
Performance boost
Clustering Graphite
High Availability
Open Issues with graphite
Ramp Up
Debugging
Eat your own dog food
Debugging techniques
Multi tenet
Events

Overview

Graphite is an end to end solution for storing, analyzing and aggregating timed data. There are many other tools out there. The familiar ones are CACTI (http://www.cacti.net/), RRDTools (http://oss.oetiker.ch/rrdtool/) and others.

Graphite has taken the solution to a new level on the architectural plain. Graphite is not only a database solution but it is a full application solution, including web interface, security, clustering and more. For a more in-depth overview of graphite see https://graphite.readthedocs.org/en/latest/overview.html.

So what type of information do you want to store in this database?

Answer: anything. You can use graphite to save metrics on anything. Depending on your application, you can monitor your cpu, disk… You can sent ticks from your app to notify process progress, and then monitor the speed in graphite.

 

Data

So what is and what isn’t graphite. Graphite does not do the actual collection of the data (if you need tools for this, see https://graphite.readthedocs.org/en/latest/tools.html). Graphite supplies the option to store data and to query the data. Since the data that you store can be very large, graphite has a built in option for retention. Per metric you can decide what the resolution is and for how long you will keep it. So for example you can define

retentions = 10s:14d

 

 

This will save the data every 10sec for 14 days (for more info see http://graphite.readthedocs.org/en/latest/config-carbon.html#storage-schemas-conf).

This way you also don’t have to worry about deleting old data from the database, as is the case in most time based solutions.

Once you have defined the retention of your data, you can then define an aggregation function for your data. This way you can keep your raw data up to a month, but you can then keep a daily average for the next year. The basic aggregation functions that are supported are: average, sum, min, max, and last.

Graphite aggregation also supports combining multiple metrics into a new one via the aggregation definition which will save process time later on (the request of data will not need to use the aggregation function when retrieving the data).

Graphite Components

The basic components of the graphite server are:

 
  •          carbon daemons that listen for time-series data over the network using multiple protocols.
  •          whisper database library for storing time-series data
  •          graphite web - application that renders graphs using a simple url api

 

Carbon

 

The carbon daemon support two main protocols: plaintext, pickle.

Plaintext is a simple TCP socket that receives data in the format of:

<metric path> <metric value> <metric timestamp>.

 

Pickle is a python format for encoding strings of the following format:

[(path, (timestamp, value)), ...]

 

 

This format allows for inserting many timestamps of the same metric in an efficient way.

Although not documented but also the plaintext supports sending multiple metrics in the same TCP packet with a new line separator.

Many implementations of these protocols can be found in the internet for multiple languages.

 

Whisper

Whisper is not an actual database, but is a library that is optimized to write time based files. Each metric is written to its own file. Each file is a fixed size based on the retention rule. This way the writing to the file is optimized (location for each metric in the file, based on timestamp is know in advance). This means that the allocation of the file is done on the first metric that is sent for this file (a utility to help calculate the file size based on retention can be found at: https://gist.github.com/jjmaestro/5774063).

The folder structure is very convenient. If your metric is a.b.c, then you will have a file named “c.wsp” in a folder of “b” in a folder of ”a”. For what every reason, if you wish to remove the metric data, you just need to delete the file.

Since the whole architecture of graphite is like Lego blocks, any part can be changed. So if you want to implement your own database library, you can go and do it (see http://graphite.readthedocs.org/en/latest/storage-backends.html).

For an example of it (and an in-depth article on whisper) see http://www.inmobi.com/blog/2014/01/24/extending-graphites-mileage.

 

Carbon-Cache

Since graphite is designed for high rate writing, obviously the IO will be the bottle neck. To solve this, graphite has added the carbon cache. All writes and reads go through the cache. The cache will persist the metrics to disk after a configurable interval. The cache holds a queue per whisper file, so that writing will be optimized and written in one block.

In the carbon.conf file you can configure multiple options to fine tune your graphite performance. An important entry is the following:

MAX_UPDATES_PER_SECOND = 500

 

This entry will define the updates per second to the disk. The less writes to the disk the better performance, but it comes with the risk of losing data in case of crash.

For fine tuning see the following article: http://mike-kirk.blogspot.co.il/2013_12_01_archive.html.

Configuration example:

 [cache]

 

LINE_RECEIVER_INTERFACE = 127.0.0.1

 

LINE_RECEIVER_PORT = 2003

 

PICKLE_RECEIVER_INTERFACE = 127.0.0.1

 

PICKLE_RECEIVER_PORT = 2004

 

CACHE_QUERY_INTERFACE = 0.0.0.0

 

CACHE_QUERY_PORT = 7002

 

 

Carbon-Rely

Since the architecture is that each metric has its own life cycle, we can store metrics on different machines, or for performance we might have more than one cache (see section on performance boost and high availability).

Configuration example:

[relay]

 

LINE_RECEIVER_INTERFACE = 0.0.0.0

 

LINE_RECEIVER_PORT = 2003

 

PICKLE_RECEIVER_INTERFACE = 0.0.0.0

 

PICKLE_RECEIVER_PORT = 2004

 

RELAY_METHOD = consistent-hashing

 

DESTINATIONS = 127.0.0.1:2014:1, 127.0.0.1:2024:2

 

 

Web API

Graphite uses Python Django web application with a REST API that can be queried to generate graphs as images, or return raw data in various formats (csv, json). The main user interface can be used as a work area to compose URLs for metrics retrieval.

The web api, can read from either the whisper file or the carbon-cache so that it can access data that has not yet been persisted.

The Web API has the option to display a GUI dashboard, or to retrieve the data via REST interface.

Getting data from graphite is as simple as:

http://graphite/render?target=app.numUsers&format=json

 

 

There are of course many options that include getting multiple metrics with wildcards. Defining time period for metrics. Choosing the format of the reply (json, png, csv, raw). Applying functions to metrics before retrieval, and many more. For more information see documentation at: http://graphite.readthedocs.org/en/latest/render_api.html.

If you want to enhance your dashboards, have a look at this open source graph editor: http://grafana.org/.

 

Performance boost

To boost the performance of graphite, it is recommended to create a carbon-cache per cpu core. This way the machine can handle more metrics at the same time. You will need to configure a port per carbon-cache (actually 2, one for plaintext and one for pickle). This is a problem since our clients do not want to be aware of this layer in the architecture. To solve this graphite uses the carbon-rely. The client needs to see only the carbon-rely, and then the relay will send the metrics to the different carbon-cache.

 

 

 

 

And for carbon relay we have the following:

 

 

 

 

Debugging techniques

The first part, is once you send a new metric to graphite, you can check if the whisper file was created.

Next step: in the carbon.conf you have the following flags:

LOG_UPDATES = log every whisper update

 

LOG_CACHE_HITS = log every chache update

 

LOG_CACHE_QUEUE_SORTS = True

 

 

To view the logs go to  ../storage/log. Here you should have three log folder: carbon-cache, carbon-relay, webapp. Under each one we have more folders per instance of application.

For example, debugging cache. Go to folder /storage/log/carbon-cache/carbon-cache-1. To debug your application sending metrics to graphite you can use listener.log. If there are any connection failures you should see them in this file. Also in case of invalid formats sent to graphite, you will see the error here.

Multi-tenet

If you need to use graphite for multiple customers, you can easily do this by adding a prefix to the metric name. You just need to remember that this is a solution on the application layer and not in graphite. So if you give a direct connection to the graphite, you cannot block the data per client.

Events

Graphite does have a simple mechanism for saving basic events. The basic structure of events is: when, what, data, tags. There is a dedicated GUI for viewing the events. You can also use the rest api to query events. The events are not the center of graphite and therefor do not have all the features that would be expected from an events system. So you if need to do anything more than a simple event you should look for a more robust system (like elasticsearch). For more information see:

http://obfuscurity.com/2014/01/Graphite-Tip-A-Better-Way-to-Store-Events

 

Hosting Graphite

If your servers have access to the internet, and you do not want the hassle of setting up graphite and maintaining it, you can always go the hosting way.