Do you know the Log Verbosity basics?
Do you master the basics of Log Verbosity?
What do you see in your app-logs? Not enough info? Too much Noise? A provoking opaque stare back at you?
When you find that your logs are not useful it’s often too late — you have a problem in production, and in addition to that you understand you do not control your logs. You lost the basics.
The challenge of App-Log Verbosity is in negotiating this tradeoff between the big picture and the detail. The answer lies in a well-thought set of policies — a strategy — that should sink into practices and culture.
I discuss here two of the pillars of a verbosity control strategy. It’s amazing how basic they are, and still how many veteran teams lost them.
- Production log settings are on info and it requires an explicit permission from few people in the organization to change it.
- You learned you can’t rely on logs and became dependent on debugger for troubleshooting.
- PRs to change a log level from debug to info.
- Code lines of logs are skipped in code-reviews.
- You depend on some 3rd party magic to draw flame-charts for you and you helplessly put up with their performance footprint.
If any of the above applies — the team should get back to the basics:
Your logging policy doesn’t work. It is either not well-thought, or not observed — and therefore your logs get wasteful and useless.
It does not have to be this way.
Warn Levels Policy
TRICKY: it’s a filter: The higher the warn level — the less logs you get.
In the end — you want a smooth linear control:
higher verbosity = LINEARLY more.
How? It’s a highly opinionated area, and has a lot to do with culture.
Let’s start with the scheme. Here’s my favorite:
The execution has encountered an irrecoverable error, cannot guarantee correct execution results, and will terminate gracefully (ungraceful termination means the process is killed — no chance to log anything… 😛).
❗️ There should be more log lines after the FATAL entry, describing the shut down flow — expect at least an INFO about shutdown complete ok. The more your application is involved with transactions — the more important it is.
The execution caught an unexpected error, but is in state which introduces no risk in carrying on work — not to data integrity nor system integrity, no risk of false-positives nor missed detections, and still can guarantee correct future results.
❗️ While the app can keep running — errors emitted to a log are a call for attention and may trigger an action. Maybe for a number of errors in a given time rather than the single first error — but emitting an error is basically talking in person to the on-call guy that gets a pervasive interrupt — do your best to be helpful.
❗ expected errors often should not log as errors.
The execution encountered a situation which might later result in an unexpected error or harmed or partial SLO.
e.g. — meeting an input that is unlikely or improbable, a missing or a faulty component that MIGHT be required later (if it’s needed now and cannot be initiated — it’s may already be an Error).
❗️ Getting 404 from an upstream service may — depending on your logic — be just fine! Logging errors in your low level http client whenever it does not get 20X is not a good idea. Same for APIs that reply 404 — it’s often what they are expected to do.
❗️ Blocking bad USER input is what your code is expected to do. Throwing warning about it is considered a conservative-paranoid policy decision.
❗️ Blocking bad SERVICE input between two internal parts where the upper owns the validation concern exposes a case of faulty execution in previous stages, and warning about it is a must. Some policies even consider that an error.
Obviously, there is no golden rule here, just a rule-of-thumb. The fault in previous stage is a bug that needs fixing, but the severity could vary from a bug that renders our service inoperable, all way to just causing some inefficiencies.
The execution has met a significant point-of-interest in its execution. The requirement is that by reading the log one would be able to point with sufficient confidence what code paths the execution ran through, and why.
This can be the basis for KPI metrics and SLO dashboards.
Any troubleshooting information a developer may need to troubleshoot an outage. Mind that abusing the debug level may make the debug level inapplicable on production environments, where you need it the most.
Don’t use debug as your trash bin, but as a mechanism that when turned on MUST NOT kill your process under load, and will help troubleshoot live issues.
❗️ Avoid adding log lines one after another in the same code-path:
- If you need more data — add the data to an existing line.
- If the existing log line is of a higher log level — gather the additional attributes in respect to a check — e.g. log.isDebugEnabled().
❗️ Avoid debugging by dumping entire objects. Doing so will deny using this level in the environments where problems are observed live.
Silly is the state where you’re in total doubt, and can’t be sure of anything. So you start shooting comments everywhere — which is just …silly. However, as silly as they maybe, the fact they got their own level suggest it’s definitely a thing. Some call it traces — but — trace is something else that was before done in logs. More on that on different posts.
❗️ The contract is that this level won’t be used on Production environments. In fact — it should not make it past merges to the master. IMHO — silly tells the reader — this is something I only needed for development. If it is justified to enter production — make it a debug.
The Scheme is a just part of the picture!
The warn-levels scheme only gets a baseline.
The next step is controlling for their distribution!
Remember: Linear control.
Distribution in code
A good implementation of the scheme would result with a rather equal distribution between the levels. A clear indication of implementation failure is when you find that the amount of debug occurrences in code is twice of the amount of info occurrences. This is most probably a case were debug will choke a process under load.
It’s not a budget for debug lines — i.e. — I do not keep a balance of spent debug statements, but I do treat them with respect and consider if each one is necessary.
Another indication is lack of errors and warnings. Again, not in log — but in occurrences in code.
I myself go to the length of asserting in tests that in the cases they are expected — errors and warnings are (a) emitted and (b) emitted once, treating it like any other requirement.
Reflect your SLOs
On top of the distribution you may yet tune entries up or down on a per-case basis depending on the impact on your SLO. There is no golden rule here — these diversions from the scheme should be actively deliberated in designs and reviews.
Emission Categories (a.k.a channels)
Consider different categories as different channels.
How is that useful?
When a few known categories give you the big picture , and all the rest of the noise is organized in other categories you can chose to see — you now have the freedom to zoom in and out as you please.
Here are two common categorization strategies:
Per Class / Module
Every codebase is divided to files, so that’s the default practice. This only works when you organize your application with separation of concerns; flow-control logic is held in controllers, implementation details are held in services (view in that sense is a case of service that does formatting).
The overall narrative is available through the main controllers, and the implementation details through the services.
❗️ If your services handle mixed concerns and play a role in a lot of use-cases, the category of that service will still include a lot of noise. You can consider having such services emit to different categories — which starts to combine the other strategy below.
BTW, if you need to do that — it’s an indication you also lost the micro in microservices — (or never really wanted it).
Focus for Flow / Use-Case / SLO
This strategy requires awareness and culture, and because of that — despite it’s more powerful — it’s less common.
Here categories do not reflect the place in code, but the currently executed flow. This poses a code-organization challenge on the reusable parts which is solved with either per-call/context logger-injection, or use of thread locals/static continuation context.
In addition , you maintain additional main category for the application.
The application main category logs the overall narrative in form of KPI and SLO events, while the Per-Flow categories follow the private implementation details.
Now you changed your single volume button with a finer set of controls.
Not just for Filtering
This division is not meant just for filtering logs: Each category can have it’s own warn-level.
Configuring the correct log-levels to categories wisely helps you make the most of your infrastructure and your logs quota, esp. with 3rd party log SaaS.
For example, consider the main categories are set to info explicitly, while the rest of the system defaults to a base level (e.g. warn), and categories are tuned by need: code that was modified lately is tuned up, stable code is tuned down.
This also gives you the chance to code dynamic warn level changes for categories, e.g. in response to too many warnings, but this is not basic any more and belong to a future post.
Gather all info, Log once
Do not allow adding a log entry where one already exists. This principle transcends code layers.
I cannot stress enough a common mistake, where every layer of implementation writes errors to log before passing it on, and you end up with a log that shows the same propagated error over and over…
You don’t need to have a log line from every level to know where the error is from: it’s call-stack should be enough. Dynamic languages let you augment the error with more information, strong-type require you wrap it with a custom error — the principal is that all the information can be preserved.
The Error Threshold
Since the error level appears almost always — manage it with discipline: You have to decide which layer owns the concern of the logging errors, and only that layer should reliably do so. Most commonly it’s the higher layers.
Consider that you’ll scan your logs for errors: every error could cross a threshold for a notification to an on-call guy. When you emit an error from every level — then flows that involve more layers gain more weight, even when they are not a business focus.
Deeper implementation layers may log errors only when they caught an error that is by design suppressed from the upper layers AND have actionable implications. Otherwise, warn is enough.
If you insist on spitting the same error from all levels — at least have the alerting system watch only the main channels…
Watch it Continuously
It’s not enough that code landed on master and deployed. Now watch how well it performs: invest in your relationship with your logs, make sure they speak to you clearly and tell the story even under stress. Refine it so they will be there for you in your dire need.
Running locally, I play with the verbosity configuration of categories a lot to see I’m in line with the requirement of logs that provide comprehensive sane full narratives that correspond to the configured levels.
I practice the verbosity control with levels every run of my code: I do not use debuggers, I relay on my logs.
When I wrote the core of what I had to say — I saw it’s way too long, so — it had to be broken down: these are just two pillars I consider the basics.
However, these basics provide a solid ground to build upon a fine control on a healthy and useful log stream.
This is the 3rd post on a series about application-logs and logging culture.
Next in the series: Expected to be on 2nd week of January 2022
We will contact you as soon as possible.