The systems we operate at Linn are comprised of many microservices deployed across 2 hosts behind a load-balancer. With so many separate processes in play - we wanted a logging solution that would allow us to easily locate and diagnose issues. In this post I describe the solution we arrived at and how we got there.
Our microservices (and various Windows Services for performing background tasks) are built using .NET, so we evaluated various .NET logging libraries including log4net and NLog. We opted for NLog, primarily because of it’s ability to be reconfigured on-the-fly (although we actually no longer make use of this feature). In my experience, logging libraries have a simple, well defined job to do and to a certain extent it probably doesn’t matter which one you go with, as is generally the case with IOC containers and unit testing frameworks.
We wrote a thin wrapper around NLog which gives us a simple API, and isolates us from change should we decide to switch to a different provider. Here’s the interface:
This also means our test code doesn’t need to take a dependency on NLog, only our own interface.
Logging Maturity Level 0 - Log Files
Any logging is better than no logging.
Out of the box NLog supports a reasonable set of targets, including writing to a file (with options for rotation based on timestamp, file size etc). So initially we set up a simple file based log which archived each day, and kept a maximum of 9 days worth of log files. For a while we stuck with that, and it worked, to a certain extent.
Having to review files across multiple hosts started to become a pain. To make matters worse, for most of our services, our load balancer (HAProxy) is configured in round-robin mode - which meant that tracing a sequence of related log messages involved alternating back and forth between the log files from each host. We also had no mechanism for being notified of errors - we would only go chasing up log files in response to an error report from a user. Error diagnosis was reactive rather than proactive.
Logging Maturity Level 1 - Error Notification
We needed a way of being notified when an error occured so we could more quickly monitor and diagnose issues.
I’d had some experience using a library called ELMAH in the past. ELMAH makes it pretty easy to catch unhandled errors in an ASP.NET application and log pretty much every detail you could think of. Using Nancy.Elmah it was trivial to integrate into our existing Nancy services. ELMAH includes a facility to send emails when an error occurs - which ticked the box for being notified of errors automatically.
So now we were being notified of errors via email as they happened. This was better, but it could still be improved:
- It only provided error notifications for our ASP.NET hosted services. The Windows Services were still just logging to local files.
- On a number of occasions we flooded our mail server with ELMAH error logs - not ideal.
- We still didn’t have a solution for searching or filtering through our logs.
Logging Maturiy Level 2 - Logging Dashboard
What we really wanted from the beginning was an aggregated dashboard through which we could easily search and filter our logs across all services.
Our system administrator suggested that we should send our log messages to a Syslog server. He set us up with the common trio of Logstash (for collecting messages via Syslog), Elasticsearch (for indexing log messages) and Kibana (for building a logging dashboard).
NLog does not have a syslog target out-of-the-box, but there were a few on Github. Unfortunately the one which had the features I wanted was not available through NuGet at the time (this has since been rectified), but the code wasn’t complicated, so I opted just to write our own target. If you’re looking for a solution then I’d recommend NLog.Targets.Syslog.
With the NLog syslog target in place, we were able to direct all our log messages to syslog, including messages from our Windows Services. Using Kibana, we could very quickly throw together a dashboard (multiple dashboards in fact) which gave us almost real-time visibility of the state of our services. We put the dashboard on permanent display on a spare desk next to the team.
We then found that because our logs were so much more accessible, we started doing a lot more informational and debug logging. Our dashboards enabled us to filter through log messages from all services and hosts which greatly enhanced our capability to respond to errors and diagnose them. We were now in a much better position. We could proactively diagnose and fix errors, and had immediate visibility of the health of our services.
However, there was still one thing missing.
Logging Maturity Level 3 - Client Side Logging
In a similar fashion to the thin wrapper we built around NLog, I wrote a small wrapper around TraceKit, which provides the API I was after and (optional) AMD support. The wrapper is called Lawgr and can be found on Github, Bower, npm and NuGet.
Lawgr exposes the following API:
By default Lawgr only logs to the browser console, but it can be reconfigured to also send messages to a remote API:
This lets us use logging both for local development and production support/monitoring.
And that’s where we are today. We went from file based logging to a centralised logging dashboard encompasing traffic from multiple services, hosts and languages. With our dashboards we can spot errors and warnings at a glance, and can quickly drill down to view the appropriate details.