Software breaks in production. When it breaks, it’s important to have all the information about how and why it broke available. Collecting this data is often a simple task, but it’s important to choose the right tool to get the most out of your data.
Simple log files
The simplest way to track the behavior of production systems is to log to disk and periodically check the output. From operating systems to common services like Apache, almost all software has a mechanism that allows you to log debug information to disk. Configuring these log files is an important first step but comes with its own set of problems. Deciding when to rotate logs (by date, by size, etc) and what log level to use are common issues. Accessing these logs is difficult in a system with many servers or a restricted access environment.
For logging to be really effective, we want the following features:
- Central storage with access controls
- The ability to search, sort, and aggregate logs (run queries over them)
- Easily expandable storage
- Easy rotation of old log data
In order to support these features, platforms have been developed for processing and aggregating logs in standard formats. I won’t cover the individual features and benefits of each different log aggregation platform here (there are too many for one blog post) but a few that are popular:
- ELK stack – Elastisearch, Logstash, Kibana
Each of these tools support a common feature set for querying historical data, finding trends, and building dashboards. They also all lack something important. While log aggregators can give you an accurate picture of the behavior of your application in production, they struggle to provide a view of exceptional conditions. Put another way, Log aggregators can’t show you what’s most important among all the log messages right now.
While log aggregators are helpful for answering questions like:
- How many times has a log message like this shown up in the last hour or last day?
- When was the last time something like this occurred?
- How many requests are initiated in a minute? In an hour?
Exception trackers focus on answering questions like:
- How many active issues are in production right now? How many times has that issue been triggered?
- How many users is each issue impacting?
- Given an issue, what is the associated stacktrace?
- What is the ticket number (in an issue tracking system like JIRA for example) associated with this exception?
Unlike log aggregation platforms, exception trackers are focused solely on the problematic parts of your application. They give you a way to triage issues, link them to the other tools in your ecosystem, and give your developers all the information they need to fix the issue in one convenient place. Most exception trackers can also integrate with incident response services (or even ship with their own) to alert your Ops team to issues as soon as they start.
Within the broad ecosystem of exception trackers, one tool has emerged as a best in breed solution: Sentry. Sentry boasts a comprehensive feature set, an intuitive and helpful UI, and integrations with many languages and platforms. It is completely open source and can even be self-hosted.
In the next post, I’ll cover options for deploying Sentry in your organization so stay tuned.