Prometheus Configuration Testing
When it comes to Prometheus everybody knows it is the gold standard of monitoring, especially when in a Kubernetes environment. What nobody told me is how hard this thing is to configure when the metrics are not exactly how you want them.
Let me take a step back.
Extracting the Metrics
For some time now we are working on adding monitoring capabilities to our spark streaming pipelines. All nice and well. We ended up installing influxDB’s telegraf on each machine with the statsd input plugin and each spark process send its metrics to the local statsd port. The reason for this setup is out of scope of this post and will hopefully come out in a separate one. But I digress.
So the Telegraf process was configured with a [statsd] input and a [prometheus_client] output. Again, nothing special there. The problem is in the metrics itself. They come out nasty.
The biggest problem I had is the name or the app id in the middle of the metric name, instead of it being a label. Ok, no worries, we just need to relabel them in Prometheus who is scraping all that information.
After I was digging around in the documentation, I understood that what I need is to write rules in the `metric_relabel_configs` section of the Prometheus configuration. There was just was problem: I had no idea how to do that.
So again, I started reading some more documentation and I finally understood how to write those rules, if only I can start playing around with it a bit. That’s when I realized: there is no rule generation or testing tool. It’s just, write as you go and hope you do not break anything…
There I was, in need to test my configuration rules with no real tool to test. So I did the only thing left to do, I installed Prometheus on my machine and tested the new configuration.
I think this is a big problem for many tools, and there are a bunch of them. All tools which need some special configuration or script but the only way to test is by using it with the tool itself. Usually the setup is long and cumbersome. Many times we do it in a production system and environment because only there we have all the different pieces of the overall architecture to really see if all works well together. Jenkins, Airflow, Apache Druid, Prometheus… This is only a partial list of tools which do not let you check/test your configuration file. You all know those tools which you use in production. How painful is it adding/removing some piece of configuration only to find out a specific parameter is in a wrong place and the configuration is not taken into account. But anyway, back to the plot.
Relabeling Configuration Testing
So after I finished writing the configuration file all that was left to do was to point prometheus to the right spot where the metrics are being stored and query pometheus for the relabled metrics right ? That is what I thought just… Apparently, I do not have HTTP access to the machine on which the metrics are residing.
Stupid, I know. Whatever.
One small step for man…
Then came the genius idea which made any other invention in the history of the planet pale in comparison: I logged into the machine which contained all the metrics, performed a simple curl and extracted the current metrics on that machine to a file. Next step was to copy that file over to my machine (sftp,scp, pick your poison). And for the pièce de résistance: I used http-server to serve that file on my local machine. The only thing left to do was change the `metrics_path` in the prometheus configuration file to the name of the file and BAM!
Elvis has left the building!
The setup itself and all the troubleshooting took me more than the actual configuration and testing. Think about that.
We will contact you as soon as possible.