It’s the nightmare of every developer who has some production code running out there. You get to the office, and before you get to your desk somebody directs you to some room where you find some red-faced Ops guy, together with your uplink, and maybe a senior manager, all looking dead serious, pointing at log snippets and arguing between them about some unexplainable outage from last night, and now that you’re here you should provide answers. But the logs are all fuzzy, and it’s very hard to make sense of whatever messages gathered in them. It’s all too much information, or not enough - at least not the part you need, and it’s all messy. How to make sense of all that?
What to expect
This post starts with some abstract materials which lay down the foundations for the proceeding more technical discussion. This post will be preceded by a post for logging in node-js.
Abstract: Logs address people and bots. Many stake-holders, opposing use-cases. Logs should describe narratives. The clearer and more readable and navigable they are - the better. But logs have a footprint and cost. Visualization tools help, but they also increase the cost. The main challenge is verbosity control, and the more control you have the better.
Technical: The principals behind log implementations, and how they are used to provide verbosity control: Emitters/Loggers and Transports/Appenders, the warn-levels & categoreis manifest, the code-path coverage manifest, and debug sampling technique.
In the bottom line it all boils down to culture.
Techniques: - In the bottom of the article I note techniques or patterns worth knowing.
I met many developers and many approaches regarding that scene above. Some developers suppress this possibility alltogether, refusing to think about it until it happens. It worked on their machine, didn’t it? It passed QA, right? (hopefully. automated or not.) But now that this happens - it’s anyway too boring here, and it’s time to find a new, more interesting job. Radical? You’d be surprised how common that is among kick-starters.
Others trust their wit to save them in such mean moments to talk their way out, trusting they can come with some smart theory and amazing ideas. Some of those even enjoy to be the only one that can really solve this problem, decyphering logs with a code browse (even in the cost of a call at 2:59am).
If you’re one of any of the above - this article is not for you. Go away. Why read about logs when you can read actual logs? Confused? Go check your social media and see what I mean. Still here? Here’s the concept.
Try to think of logs as something not much different than posts in social media:
- Every post has a timestamp, an author, some message text, attached data and possible call-for-actions. Each author shares about occurences in their lives, and they all spit their messages concurrently.
- There are many streams one follows, and possibly a few streams one produces. The better platforms know to aggregate the creme of the streams one values most, and present it as an expandable summary.
- Streams often address people, but the bigger value is drawn of them by bots. Sure, your cat’s latest funny pose is precious to you, but overall - It’s the bots that check how much you’re active. It’s the bots that rate your content and feature it next. It’s bots that decide what ad to show, and if to give you a coupon with it.
- The real value of posts in social media is the ability to respond in time to important information by acting on it in your own life , but I don’t mean like and definitely not report abuse (which could have been nice if we had it in logs.. :P ), but to take a real-life action derived from the info - like making it to a concert you like, getting that edge-product you were waiting for, or picking up the actual phone and calling an old friend who just got married.
But if we’ve mentioned report abuse - here’s a couple of examples:
- I hope you’ll agree how annoying it is when somebody spits a comment like
AAAAAA, or any of the like without any clear meaning or context, let alone such a post or status. On the least worst it becomes a piece of the noise you learn to ignore.
- And how would you feel if you see somebody describes their breakfast one component at a time.
Eggs: 2, poached, and
Onions - fried on toast, and right after
Onions are fried with Dutch Gaudaand right after - …if you did not lose the patient with this one, you discovered that
toast - full grain, seared. Consider also the contrary case - when that breakfast is for some reason important to you, and you need to collect all their posts from a huge scrolled message stream to learn that they decided to add
tomatoes in arugulawith that.
We can’t educate everybody about how they use their social media. But luckily, we can educate the teams we’re part of, or at least - lead in our teams a healthy culture of how to use logs.
This to say who uses logs and what the heck are they hoping to find there.
Logs start with the developer. Developers read logs when they develop. They do get to read logs after they deliver - but this scenario is much more rare. It’s like that code-quality paradox:
Well made code is by definition never get’s appreciated. Software is best appreciated when nobody has to look at it’s code. If somebody had to look inside - it’s already beacause something went wrong: it’s either not working as expected, or lacks features.
It’s the same with logs, but worse. If somebody is reading your logs - they are already in distress. The least you could do is make reading pleasant for them.
The big problem is when developers do not comperhand that logs are part of their product. Many developers believe that logs are some side-effect, or some preparation for emergencies that never come. But they do.
Even if your logs do not address your end-users - they ARE used by somebody, even if it’s currently now just you. Give the reader a good break!
Ops read logs to make sure the service they just deployed for you started OK. They sometimes make sure it had shut-down gracefully. They also read logs when some KPI fell out of boundries and the troubleshooting know-how does not help, and they have to makedo on their own. Sometimes these Ops are not deploying for you - they deploy your distribution to their on-premise environment - and there your situation is far worse. Any hickup in your logs results in far greater mountains of frustration, without you there to tell them - hey, it’s just a warning…
Support Engineers, Account-Managers, or in their modern name Customer-Success - read logs too. Whether it’s to search for an error a customer complains about your production, or to help with a problem an on-premise distribution - they dig their noise in. You’ll be amazed what stories many of them make of the logs they read, and how far from the truth these stories are. In the end of the day they are thinking minds with a job to make sense of all that and build a narrative that will help them find some solution to a user. But you don’t have time to explain them every log line and when it occurs, and they don’t ask. You can’t afford to be unclear with these guys.
In some occasions - users read logs. The obvious example are CLI tools, but who of us never clicked the “more info” button that some installers provide? Well, it does not end there. Let’s just acknowledge that a service is a CLI tool that binds IO: web ports, TCP connections, message-consumming connections, etc. If a message that a tool ran successfully is taken for granted - the startup and shut-down of a service should be on the same status.
Big Data Analysts
These guys look for patterns and cling to whatever they find to construct their narrative and back it with data. Often they find themselves with petabytes of log data, and a requirement regarding some business value to pull out of it. They just as well might miss the
dfdfdfdffff you sent to
System.out.println, or find it and keep to themselves what they think of it. But they are also thinking minds with a job to build a narrative and interpretations of their own and salaries that compete with developers. With these guys ambiguity costs more than unclarity…
They enter the circle much sooner than Ops and support, and I had in mind to consider them a part of the development team as modern teams do - but they still deserve a section of their own. I count them here, after Data-Analysers - because their use-case is the same: They browse through the logs, interpert them to build a narrative they can use to affirm their claims that the sofware works as expected, or not. In the better scenarios they are indeed a part of the development team and know personally every log line - but this is sadly yet to become the norm.
Well, this is a general term for batch-jobs, ETLs, data-reducers and such. Many of them are directed and/or programmed by Analysts, and sometimes by developers themselves. In the better cases there are bots that are put there by Ops and do some live monitoring. In the best scenario - the title DevOps is understood like it should be - Ops coded by Developers (unlike Ops that develop automations - which is better than manual rollouts, but still not what DevOps was really meant to mean) - and all that BI & monitoring pipelines is handled close to home by bots put by the same team who writes the logs. But although this scenario gets more and more common - it’s sadly still very far from the norm. In glorious cases - logs of CI-envs and QA are scrutinized by automation bots provided by QA specialists. More on that later in the Log Sanitation section.
. . .
All these stake holders may pose demands regarding things they would like to see in logs. Unfortunately, many of them don’ t know what to ask or how to ask, and when requirement came from them - it’s beacuse frustration-levels are sky-high, and on top of that these requirements are often passed via a non-technical product manager… No guys. It’s on us to think it out in the first place and consider it a part of our product quality.
If to try and reduce all the stake-holders and the differences between them to one sentence - it would be: Create visibility about what’s going inside The log should narrate what is done. A failure to show these narratives is no less than a bug.
But there are two distinct use-cases:
- When we look at the logs as UI - we just need to reassure the software works as expected. Here we would like to stay HIGH LEVEL.
- When we look at the logs to analyse/troubelshoot - we want to see deep level narratives: Sane full narratives that explain themselves and the experience the software gave. Here we would like to get VERBOSE.
The main takeout here is the requirement to reconcile two contradicting requirements. The first is not interested in the details, the second depends on it. Unlike installers - logs do not have a “more info” button. And despite the fact that logs often come with warn-level - this warn level is meant to be left untouched. So one must constantly negotiate the tradeoff between the big picture and the details, and intelligent ways to keep intelligible data accessible when needed, and out of the way when not.
One last stop on the abstract section - is how to view logs. In fact, the days where logs were by definition just bulky text files often rotated to
.tar.gz are behind. Many softwares still do, and that is getting obsolete. We want to get modern.
Here are some false assumptions about logs that should no longer be taken for granted.
logs are text files - wrong !
Logs are often not written to local files at all, but emitted to a remote appender. I’ve seen logs writing to an SQL table already 15 years ago, and now you can commonly find appenders to a collection of noSQL docs or even ones that emit via API to a 3rd party service (logsio, datadog, sumologic, cloudwatch/trail, etc…).
The takeout is that managing local log files, and log-rotation is not a concern of the app. Apps used to do it, but it has never been. More over - coding your program with an SQL appender or a 3rd party appender is basically coupling your app with it, even when it’s wittily configurable it’s unwittingly ships all the code that supports all these options, whether they are running or not. This may be considerable for a CLI, but surely not for a micro-service or a serverless function.
The more modern approach claims that the app should just provide the log stream, and it’s up to the user (usually Ops) to decide if they wish to pipe the stream to a file, or see it in console, or pipe it to a few different appenders that send it wherever else.
Here I must note that many Ops policies require still to pass through hoardable files - but that’s their concern, and they can do it even when the app emits to
stdout by piping the output to a file.
logs are text - wrong !
Well, kind of. Many logs are structured data in JSON streams, which are technically text, but while every JSON is text, not every text is JSON. One can claim that even JSON is an implementation detail, and log-entries are basically no longer message-strings, but property-bags. In fact - appenders do not have to use JSON over the wire - but many of them do, because it’s a common convenient standard.
True - when your app outputs a JSON stream it makes it unreadable - but this impediment on development is easily bypassed by a CLI-tool you can pipe the log output into, that provides a readable format, structured for the human eye, and decorated with colors. Actually we start to see logger implementations that feature such CLI-tool out of the box alongside its appenders and emitters.
logs are read raw - wrong !
CLI tools often emit text - and this text should be read raw - I’ll give you that. But already now for quite a while the more fancy CLI-tools let you specify a reporter that emits a machine-readable output. Examples? Test runners, Code-Coverage & Static-Code-Analysers just to name a few.
But beside output of CLI tools - nowadays, the only ones that read logs raw are service developers who run software on local-machine, which is not connected to any visualization solution.
In modern places the Dev-Servers that run feature-branch builds, the QA env that runs the stable-trunk, the staging, and obviously - Prod - are all connected to some log-centralization & visualization solution, and the even more modern cases - the developers pipe their app output into a visualization tool, and view it’s output instead - even in develop & debug environments.
The takeout here is that log-formatting is not a concern of the app. It’s replaced by data-structuring. Formatting and Visualization are moving away from production code (heh. nostalgic for the
. . .
In Summary, You have to keep in mind these points, especially the ones that come with perks: structured data helps you retain context, colors make logs fun, visualization tools make logs awesome.
The stress is on the difference between two concerns: log emissions and log visualization. In the old world, the logs were the visualization. What you wrote to the log file was what you read later, so it was common to format data in ways that are easier for a human reader to process.
The modern approach requires applications to provide structured data, and let the visualization be handled by designated programs - together with aggregation and analysis - all for which structured data is more comfortable to consume and program for.
Technically speaking - in this article I just dip the toe in tech terms. But this section is one layer down the abstract above, as it discusses architecture and practices without getting platform-dependent. There will be later an article of Node.js implementation with code-snippets, and possibly an article of Java implementation with code snippets.
Connecting to the main takeout of the Goals section - the challange is verbosity control. This is accomplished by a smart structure that has evolved to separate concerns:
Modern loggers are built of two parts: Emitters, and Transports. Although the terms may vary between different providers - the concepts stay the same.
These are the data-collectors, the ones you use to spit data. While the barebones implementations such as
echo, and such just pass a string to a text-stream and maybe help a little with expression formatting - modern logger implementations do a little bit more:
- represent a channel of a category
- implement a warn-level - discarding any message that does not qualify for the required log-level
- manage warn-level - i.e. log level which can be updated dynamically.
- carry on a context - i.e a property bag that will be included in every log entry emitted via this channel
The last one is the more modern approach and represents a shift. Logs traditionally were initiated on start-up and held in a static member per class / module, where all the log entries by that class and it’s inner classes were usually thrown down that stream - nowadays, it gets more and more common to see a logger instance to be obtained dynamically on every request, where the logger is constructed with context data, and knows by itself to append it to every log entry.
A.K.A appenders - are the parts which - given an entry should enter the log - handle the formatting and sending/writing/dispatching of the message, in correspondence with the target. Common transports are Console, Text-file, SQL, noSQL, SMTP, 3rd-party web API, etc.
While it varies in details between loggers implementation, by concept there is a correlation between a category and transports in a sort of a many-to-many relations: A transport may be associated with many emitters, and a message sent from an emitter may be handled by a few transports.
Warn-Levels are usually attached to Emitters, but some implementations let you plug a Transport which will filter further - even if an Emitter decided a log entry should be emitted - the Transport may decide that only entries of a certain log-level should be processed, and the fancier ones let you do that per category for each transport. While it opens many possibilities - it’s all a part of the old world.
In the modern approach, we see the Transports moving out of logger implementation to a program that consumes the app’s piped structured data.
So, whether your appenders run as part of your app, or as a tool that the app output streams are piped into - how do you create verbosity control?
Log levels configured per category should let you tune the amount of data you’re getting in response to the situation on a given environment. Each environment has it’s own configuration in response to the required verbosity. Why tune verbosity? Why not just work on highest warn-level?
Warn-Level impact: Cost
Mind that emitting verbose logs have a cost in performance, storage, and analysis resources. In the end of the day it translates to money. To make things clearer - a distributed application which runs on any of the popular clouds (aws, google, azure) pays more for the logs tranmission over the wire than its storage costs. Add to that data-enrichment, data-healing and visualization tools - it could get pricy just in terms of your cloud-provider bill, let alone the time & CPU it takes to browse mountains of data to troubleshoot. And if you’re not using cloud - what you’re not paying to the cloud provider you’re paying in maintenance. Collect data by need.
Setting Warn level is only half of the story
- Mind that formatting and gathering log data in every entry costs memory and CPU, especially if you create immutable or specialized data copies to pass to your logger function. Choose a logger implementation that let’s you check for the active warn level, and perform that gathering only when your active warn level requires it.
While it’s actually better to pass neutral message strings, and pass properties in the property-bag - I still too often see formatted messages in logs, and since loggers come with formatting functions - this could very easily be abused.
Common Example of a complete abuse is with ES7, which brought a syntax for string templating - the backtick - which rids you of string concatenation with an elegant buffered template implementation, and makes it just too easy to format strings.
But if you are using backtick templates as your log messages - the template will render even if the logger discards the message for warn level filtering. So if you already have to format message strings - it’s either preceding such log-entries with a check on the warn-level - or not using backticks for log messages, and let the logger do the formatting (using the go’old
Dynamic warn levels
Prefer a logger implementation which let’s you raise warn level within a given context (e.g web request, consumed-message) usually in response to a warning detected in that context.
In C# and Java it’s all but too common to hold logger instances in static variables. In rare cases logs are passed as part of the context. You could change the warn-level of a given context if your emitter implementation consults a dedicated value on
ThreadLocal before every emission, and escalates this value on the first warning. Sure - it poses a performance penalty - but since a lot of code is already written with static loggers - you might want to consider that.
In Node.js it’s common to provide a logger emitter per request, with it’s own state. Node has what is called continuation-local-storage - but I strongly recommend against it. I’ve seen memory and performance problems it introduces, and there’s the occasional memory leak too - so I recommend to stay away from it, and use better patterns encouraged by the platform and the language - passing a context to all processing depth, and place the logger there.
The power of warn levels come when you have a strict policy regarding what levels mean and how to use them. Here’s my recommended scheme:
Fatal = the process has encountered an unexpected error, cannot guarantee correct execution results, and will terminate gracefully. (ungracefuly means the process is killed, and has no say - there won’t be then a chance to log… :p). There should be more log lines after the FATAL emtry, describing the shut down flow - expect INFO about shutdown complete ok. The more your service is involved with transactions - the more important that is.
Error = the code caught an unexpected error, but is in state which introduces no risk to carry on work - no risk to data integrity or system integrity, no risk of false-positives, or missed detections, and can guarantee correct execution results. While the app can keep running - errors emitted to a log are a call for attention and possibly trigger an action. Maybe a number of errors in a given time rather than a single error - but emitting an error is basically talking in person to the poor on-call guy that gets an alert message about too much errors on production. So do your best to be helful.
- Warn = the code encountered a situation which might later result in an unexpected error. Either by meeting an input that is unlikely or improbable, and therefore may indicate a faulty execution in previous stage, or met a faulty component state that MIGHT be required later (if it’s needed now - it’s already an Error).
- Mind that getting 404 from a downstream service may - depending on your logic - be just fine. So logging errors in your low level http client whenever it does not get 20X is not a good idea.
- Mind that blocking bad USER input is exactly what your code is expected to do. Warning about it is considered conservative - paranoid decision. But blocking input between two internal parts where the upper owns the validation concern - is a must. Some Architectures even consider that an error.
Info = the process has met a critical point in its execution. There should be preferably exactly one such log entry for every code path that executes more than a few lines of code. Without getting into line counting consideration, the requirement is that by reading the log one would be able to point what code path did the execution take. There’s also a difference between Info entries on the main controller, and Info entries on inner parts - more on that in the Categories section.
- Debug = any troubleshooting information a developer may need to troubleshoot an outage.
Mind that abusing the debug level may make the debug level inapplicable on prod, where traffic is high.
- Note to Avoid adding arbitrary log lines one after another. If you need more data - add the data to existing lines. If they belong to a higher log level - add additionally gathered data in response to la warn level check.
- Silly - Silly is the place where you’re in total doubt, and can’t be sure of anything. So you start shooting comments everywhere - which is just …silly. Use as much as you need to gain confidence, and remember to manage gathered additional data in respect to warn level. The contract is that this level won’t be used on PROD. Maybe on qa, and in rare occasions on stage, if it won’t crush the service.
I personally practice the verbosity control with levels with every run of my code. I personally do not use debuggers. This keeps me in line with the requirement of logs that provide comprehansive sane full narratives.
Consider different categories as different channels. The early log4x implementations defaulted to one category per class/module, but one can define a category to be a flow, or a usecase, initiating the logger with any string as logger-name. It basically depends on how you world like to analyze your data later, and correspondingly - how you would like to tune your data-stream.
The design goal behind categories is to let you zoom in on chosen parts by need. The usual strategy is a routine in which all categories default to warn, except for the main controllers which are explicitly set to Info. This should let you have the overall narrative through the main controllers, and when that’s not enough - you can increase warn-level for channels that handle the code-paths between existing log-entries where visibility is not enough. Although if you have followed the instructions on levels well - it should be !
In this there is no way I can stress enough a common mistake, where every layer of implementation checks for output and writes errors to log, and you end up with a log that writes the same propagated error to the log, once per layer. You have to keep in mind that you’ll eventually want to scan your logs for errors, and every error could mean crossing a threshold of a notification to an on-call guy. Deeper implementation layers may log warnings - but not errors, unless - they in deed caught a truly unexpected error.
Mind that you don’t have to have a log line from every level to know where the error is from: it’s call-stack should be enough. Dynamic languages let you augment the error with more information, strong-types require you wrap it with a custom error - so all the information is preserved.
On the same breath - if emitting a waring in a context increases it’s warn level - you should also consider when to throw a warning.
Context is what helps group log-entries together and create a full picture.
Sometimes your full picture is a full narrative. In other case - your full picture is success-rate of processing a given type of incoming-message, or a count of the times a use-case is used in the system. If you want to be able to collect this information from your logs - you’ll want to send a set of attributes with each log entry by which you could later cut the log-data. e.g. - host-name, request-id, input-parameters - these attributes are called contextual-data, or just context.
Most loggers come out of the box with time-stamp and warn-level. Some gather automatically the host-name and the process-id, and the fancier ones collect the calling function and line-number. I usually avoid the latter for the arguable heavy cost of performance, but I still run into such loggers a lot.
On top of this out-of-the-box context - every organization adds it’s own layer of gathered contextual-data, in a way that is most comfortable for it’s members to view, filter and analyze.
Common information that is gathered:
- web context
- procotol, verb, URL, http-headers, parsed cookies, path-params, qs-params, client-ip - all these are a common must that is very easy to gather. It gets more complicated with request body - some request bodies can be very long and require a custom logic to select the representing nuggets.
- What I like to recommend in this case is to decouple the log-data from the fact it came through a web-request. I mean - I do include information about the message being an incoming web-request - but I like to aggregate all the parameters to a single collection where I reduce the final word from all web sources, and present this collection as a sibling of the request descriptor. The response descriptor is obviously observed on the later log entries - the ones that are emitted after the response is sent.
- Special attention should be given to request-id, or correlation-id, or trax-id - which is usually an http-header that is propagated between micro-services that helps constructing a full narrative across all uServices triggered by a single user-action. If the processing is then delegated to an async queue - the correlation-id should be included in these messages (zipkin, open-tracing, read more).
- message queue - source-queue/channel, message-id - a must. Here we need to construct the representing context from the message payload - and we need to especially remember the correlation-id. Sometimes including all the payload is an option - in many times it is not. The payload may include long data-chunks that make no sense to repeat in every log message…
- batch processing - this is like an async messasge-queue that processes every now and then a whole bunch of messages that were accumulated for it while it wasn’t running. Often it’s a set of CLI tools ran by cron-tasks. Everything that is said about message-queue is right about batch processing, where instead of the incoming messasge - one has to represent the current item in the batch. A good addition would be the start time of the batch processing, and info about the processing agent as well.
- Sometimes processing is done in batch because the sought value is not in single-messagaes, but in their whole. In this case I try to architect it so that creating a reducable view of each message is decoupled from reducing value from all messages together, where the first processes each message, and the message of the latter is the entire view - each measurable & tunable on it’s own.
- all of the above - I advise to add more information about:
- the processing agent, expecially the version, and a trace to the commit and build it came from. Sometimes the version is enough, in other cases I append the commit-hash and the CI job id that produced it.
- Logger version - i.e the protocol by which the message is formatted. This will help you handle evolvution of data-gathering.
- Execution time of every message - regardless of the message type.
Ok, so you decided now you want everything in your logs. Here I must briefly mention things you want to explicitly leave out. It differs from organization to organization and their privacy-policy regarding user data, but lets start from the obvious:
- Configured Operational Secrets - you do not want these to leak out into logs: DB passwords. Encryption keys. Shared secrets. Salts. API Keys. These are Secrets.
- Topography information - I see how common it is to configure hosts to access peers by IPs. This topography information is very helful to Hackers. It also makes higher code manage lower-levels. What I recommend is configure access of peers by host-names, and keep the actual IPs hosts file - or better - local DNS. This not only decouples the configuration further from actual IPs, it also requires a hacker to penetrate the DNS in order to get a list of next targets.
- User sensitive data - user passwords, user emails, credit cards, in some cases - depending on policy - even user names and user personal names. In most cases user-id should be enough. Keywords: Logs Anonymization, Logs Sanitation.
There are few ways to handle such info:
- elimination - is where the field is not sent to the transport entirely.
- trancation - is where the field is trancated, like showing only the last 4 digits of a CC card.
- obfuscation - is where field value is messed with. Like covering with asterixes all but 4 digits of CC to help validate all digits are there.
- encryption - is where the field value is salted and hashed somehow - not only to verify that it’s been sent, but also allow validate in some terms that it was delivered with the right value.
While transports can be moved out of your main app-process, sanitization of logs should be handled on the emitter side to make sure no sensitive data leak out. There are two common practices.
- emitter implementation filter - means that the logger is configured to detect dangerous fields and values, and handle them before they are passed to transports. This means that every log entry is checked for a myriad of options. It can be narrowed down by providing an explicit set of fields to handle and handling strategy per category as cascading values of an overall policy - but even then it’s a load of processing - what I call a logger’s hunckback.
- filter in user code - means that the responsibility is on the user-code that passes the info to the logger in the first place. Bugs here are caught in automation tests where application logs produced during automation are inspected against dangerous values who’s appearance may fail CI. If you can’t have that - go back to the hunckback.
Depending on your policy, capability and culture - you’ll use a mixture of both. Me, being a tight scrouge about any production CPU cycle I can get - I lean on the second as much as my setup allows.
Holistic / Compatibility
It’s easy to complain about culture in social media. On top of grammar and clarity comes the style, emphasized by generation gap. This is true for logs as well. Even if all the logs you’ll write yourself from now on will be modernized - you’ll still find in your system parts that produce older form logs and require special processing before they can fit to your log-centralization and visualization system.
But we could provide a full and coherent narrative only when we include in our logs-centralization EVERYONE.
Everyone = logs of:
- your microservices
- your legacy monolity (or non legacy monoliths… :P)
- HTTP-request logs from your load-balancers
- a trace of every outgoing request
- Database logs: SQL, noSQL, graph-dbs, key-stores - you name it.
- Stats counters
- OS performance metrics.
Not all of them will talk the same language and you’ll need some adapters layer to pull an anonymous text-line from some database log, extract parameters from it, enrich it with info that is not there but is still available - e.g. the host it came from - and send it to your centralized data-pool. And there are ready-made solutions and customizable solutions for the majority of components used these days. Examples - logstash for in-house solutions, or datadog, logz.io and even newrelic for saas.
I’ve seen a wierd case where the Where-Clause in SQL statements was “decorated” with
... AND '.' != ? where the
? embeds the correlation-id - just to force the correlation-id in the SQL log, or other dirty tricks to add it to the Select-Clause. It depends on how important to you it is to capture every hop in your analysis.
In many cases, in the light of the cost of such setup, it’s often decided to leave some components out - and this is a legit decision. But this possibility makes it even more important for you to ensure your app-logs provide the required visibility, even if other parts of the app are not in the party.
There are many points about culture scattered accross sections of this article. Here is a recap of them, and more.
- don’t just AAAAA dfdfdfdffffff !!!=====. Speak english. This requires some thought and phrasing. I personally have a special habit to malindent temporary unformatted debug prints to make them bold and prevent them from being checked-in without proper phrasing. But you can’t do that if your IDE auto-indents on every <enter&rt; stroke.
- don’t emit consecutive log entries in a same code-path. Remember that each entry carries a context wich may not be small like the single parameter you now wish to externalize in logs. For me - in the phrasing phase as I go over the malindented emissions and decide how to phrase them propperly as part of the narrated story - I often find that I could append the additional info to another existing log line.
- don’t gather unnecessary data - in case the gathering of additional data implies CPU abuse - I have to include an
ifstatement that checks whether the warn-level justifies, especially when I add additional data to an existing log which may or may not meet the warn-level I aim for.
All this comes easy once you internalize that logs are a part of your product, and their quality must be maintained. And now for the bonus part.
The practice of providing healthy quality logs is a sub-section of a bigger culture. Discussing this culture is worth a much longer dedicated article, - well. whole libraries of books actually - but I won’t come clean unless I mention it here.
In my language it’s called Value Driven Development. This is a concept by which every feature, every bug, every task that lands in the R&D backlog has to be expressed in quantifiable business value.
Example - consider this story:
As a user When I login I would like a clear visual signification that my caps-lock is on So that I can correct myself
In VDD the context for this story would be:
Reduce the amount of failed logins below 15% without sacrificing user safety
Which is just one effort within:
Improve user retention by - increase Day7 retention above 35% - increase Day21 retention above 15%
Well. To make such a claim, one needs to know how much failed logins are in the system. In VDD the story above will make sense only when you have got a clue that the failed logins might have occured because the caps-lock button was on. Think of the data you’ll need to collect to make this observation - and yes - information through support tickets count too, but why wait?
This to say that many a feature enter some product only because somebody thought it would benefit somebody without any data to back it up. Sadder than that - in many cases nobody measures the impact of that delivery once it’s there.
In VDD the Definition-of-Done is the desired move in KPIs: You use many KPIs such as the rate of failed logins to quantify the state of your product, and use that to detect/verify value that can be/has been delivered - usually in the form of product features, but not only. Support procedures, company processes (such as qa-automation and rollouts) and more - all bring value, despite not being product features.
To be able to do that - one mostly needs healthy quality logs and good analysis & visualization.
One last note here - business value is not aimed only at users. This derrives from the service-oriented culture where every stakeholder is a customer. Measuring by business value - one may go as far as saying the delivery that gets to users is just the tip of the iceberg. The actual product includes the process, the people, and the culture .
Here are some examples of common internal KPIs with a lot of business value:
- Duration of a CI/CD cycle
- Time it takes a feature to delilver (once it has gained priority, left ideation and entered the pipeline)
- Number of failed builds on the stable-trunk branch
- Number of Nodes in the cluster
- Average CPU load (in the cluster, on user-device - depends on your business…)
- Average lifecycle duration of a support ticket
You’ll note that some of these KPIs require to analize also Jenkins build logs and issue-tracker streams. I told you - everyone…
Maybe not all as part of the same index, but definitely …EVERYONE.
In this section I point techniques which I found useful in many scenarios. Not each of them suites every usecase - but they just worth knowing.
Remember the breakfast logger? Imagine a tone of thowse logging concurrently, and consider you really need to know about these breakfasts because it matters to your business value. Collecting the narrative behind each breakfast could be tedious. Although modern data-solutions are perfectly capable of meeting the goals - but like in real life - more often than not it’s easier to put things right in place than to hoard then and try to sort them out later. With logs - it’s often cheaper if developers give a hand.
A summary entry is a sinlge log entry which contains a full narrative. Consider that instead of sending every message right away to the appenders, a logger accumulates sub entries, and waits for a conclusion signal, where this conclusion signal causes all accumulated sum-entries to be sent together. e.g. - on web context - hook on the end of a response’s lifecycle to send all the summary together.
A sub-entry may contain sub-context - depending on the business requirements. I recommend to collect as a minimum the time the message was passed, so when I do it I never use just an Array of strings.
- save a lot of log space. The wrapping context is sent once, instead of once per entry.
- save time browsing logs: each entry is a full narrative! easier find, easier read, easier analyse.
- The data is not in the data-pool until the data is in the data-pool.
The challange is to make sure no data is lost: Some businesses require not to lose a single log entry. Here you may consider firing each entry right away AND collecting it to a summary entry. If you have to do that- you may consider having the summary logger emit to a different logger with it’s own category and warn level. This would give you a better control.
Debug-Sampling: The Sweet Spot
I’ve made the claim that too high warn-level costs a lot. Now I’ll make the opposite claim, and advise how to find the sweet spot between them.
People are …well, people - means - human. And I (yet) consider developers human. This to say that even with a clear policy documented and communicated accross all teams - people still have their own opinions and interpertations regarding what is important and how much, let alone make mistakes. It’s terrible to find retroactively that not enough data was gathered about an outage that has occured in the past, and the only thing that can be done now is increase warn level and wait.
Debug sampling is a concept where your logger implementation decides arbitrarily to raise warn-levels to debug for a control-sample of requests. This flow is totally independent to the warn-levels escalation of warnings. It may be configured as a percent of all incoming messages, and if to get fancy - a percentage per type of incoming message. e.g. if it’s web - then per web-endpoint - i.e verb + route. Or if it’s a message consumer - per message-type.
This assures a chance that when an outage happens and some endpoint gets deployed with a bug, or an update to some uService has just introduced an edge-case that is not handled - you just might already have data about it from the first malprocessed messages.
In C#/Java - if you got your developers using an organization implementation and tweak behind the scene a warn-level per context - this one would be a breeze. In node.js you’ll accomplish this with a simple middleware which should run right after the middleware that initiates the
req.log, or even as part of that.