You may be wondering why you should bother with Kubernetes' logging tools?
Kubernetes dominates the container orchestration market and is often used to host microservices. Each instance of a microservice generates large numbers of log events that can quickly become difficult to manage. But worse, when something goes wrong, finding the root cause can be tough due to the complex interactions between services and the near-infinite number of possible failure modes. This potential for trouble has fueled the popularity of log management tools for Kubernetes.
But why do we have so many tools? Is there one perfect tool to cover every need and make monitoring, logging, and root cause analysis as efficient and as quick as possible? As you might have guessed, no.
The majority of Kubernetes log management tools are variations of ELK, do similar things, and have similar limitations. These tools help you access logs and search for information, but the catch is, you need to know what to look for. Most of these tools also require parsing rules and alert rules to work correctly. But I encountered one exception that doesn’t need manually created rules to automatically detect problems.
Read on for my list of the best logging tools for Kubernetes in 2020.
1. Zebrium
Did you expect another tool to come first? Perhaps Prometheus or ELK? Nope, I put Zebrium in the first place because I see that this tool has the potential to become the next big thing in Kubernetes log management.
This new startup has recently been placed on both “Gartner’s Top 25 Enterprise Software Startups To Watch In 2020” and “Forbes’ AI 50: America’s Most Promising Artificial Intelligence Companies”.
Talking about success, Zebrium has also recently helped Sweetwater to reduce incident tracking time from 3 hours to just minutes. Zebrium can even uncover hidden issues that haven’t been noticed before. This is an excellent feature as it can help to detect problems before they impact customers.
So what is it that makes Zebrium’s approach stand out from the competition? Well, they use Artificial Intelligence (AI) to find issues as well to uncover root cause automatically, while all other tools rely on users adding rules manually. Zebrium can also be used as a standalone log management platform or it can integrate with the ELK Stack (they call it ZELK Stack :-) or other log managers.
This sounds like a dream come true, so I gave it a test on a very simple project. In this test, Zebrium automatically detected a problem where a network call was timing out. I didn’t build any rules for this, nor did I monitor the system manually. Zebrium just picked up the issue through its ML-based algorithms and let me know immediately.
It’s also important to mention that I'm not a professional DevOps engineer, and I haven’t tested Zebrium yet on larger projects.
Pros:
- Easy to start; just copy/paste customized helm or kubectl command.
- Automatic detection of problems and root cause without needing manual rules.
- Can be used as a standalone log management tool or as an ML Add-on to your existing log management tool such as the ELK Stack.
Cons:
- Not as well-known as its competitors.
- The free plan is limited to 500 MB a day with 3-day retention.
- Supports Kubernetes, Docker, and most common platforms but no native support for Windows yet.
Link: https://www.zebrium.com
2. Sematext
This is a solution for log management and application performance monitoring. Sematext provides full-stack visibility of a system state.
Sematext is not limited to K8s logs, but also does monitoring and alerting for K8s (on metrics and logs). Logs that are collected are parsed/structured automatically for several different known log formats and users can also provide patterns for custom logs. It also exposes the Elasticsearch API, so you can use any tool that works with Elasticsearch such as Filebeat and Logstash with Sematex too. You can use it as a variant of ELK or with the native Sematext ecosystem. The tool helps to create specific rules to monitor specific cases and catch anomalies. Clients can control and monitor all services, thanks to Sematex’s comprehensive real-time dashboard.
Pros:
- Integration with other Sematext Cloud tools like Experience and Infrastructure Monitoring.
- Configurable overage controls cost by stopping logs from being accepted.
- The flexibility of ELK.
Cons:
- Sematext widgets and Kibana cannot be mixed on one dashboard.
- Custom parsing needs to be done in the log shipper, Sematext parses only Syslog and JSON on the server-side.
- Weak tracing functionality although they plan to improve it.
Link: https://sematext.com/
3. Loki by Grafana
Third place in the K8s log monitoring tools list is not for ELK, but for Loki. Loki is a multi-tenant and highly-available log aggregation tool inspired by Prometheus. This tool helps to collect logs, but users will need to build manual rules for it. Loki works with Grafana, Prometheus, and Kubernetes. Loki can make your internal processes much more efficient. For example, it saved Paytm Insider 75% on the cost of logging and monitoring. Loki achieves a lot of efficiency because it does not index the contents of your logs but instead only indexes a set of labels for each event stream.
Pros:
- Large ecosystem.
- Rich visualization capabilities.
- Efficiency due to not indexing log content
Cons:
- Not optimized for Kubernetes log management.
- Lots of manual work for building rules.
- Lack of content index potentially limits search performance.
Link: https://grafana.com/oss/loki/
4. The ELK Stack (a.k.a. the Elastic Stack)
Finally, ELK makes the list in fourth place. ELK is maybe the most well known open-source tool for log management in general. ELK is an acronym for Elasticsearch, Logstash, and Kibana; each component takes care of different parts of the logging process. Elasticsearch is a powerful and scalable searching system, Logstash aggregates and processes logs, and Kibana provides an analysis and visualization interface that helps users make sense of data. Together they provide a comprehensive logging solution for K8s. Note there are many other variants of the ELK stack (like EFK Stack - Elasticsearch, Fluentd, and Kibana).
ELK is used by many big companies such as Adobe, T-Mobile, and Walmart, so you can be sure of its robustness. In general, this is a reliable and well-proven tool. I put it in third place because of its complexity and the significant resources required for it to work.
Pros:
- The tool is well-known and has a huge community.
- Very broad platform support.
- Rich analysis and visualization capabilities in Kibana.
- Requires complex parsing for logs and manually defined alert rules.
Cons:
- Difficult to maintain at scale.
- Lots of tuning, particularly for large environments.
- Heavy resource requirements.
- Some features require a paid license
Link: https://www.elastic.co/what-is/elk-stack
5. Google Operations (Formerly Stackdriver)
Google Operations, which you might know as Stackdriver, is the native tool for monitoring, troubleshooting, and improving application performance in tech-giant Google’s environment. It collects metrics, logs, and traces across Google Cloud and your applications. Google Operations is an equivalent of CloudWatch on AWS and, as with CloudWatch, it has both Logging and Monitoring solutions.
Cloud Logging is deeply integrated with GKE and is added by default to every GKE cluster you create. Your logs are stored in Logging’s datastore and are indexed for both searches and visualizations. Cloud Logging supports flexible queries (that can be saved), simple field explorers, and histogram visualizations and can be seamlessly integrated with other tools from Google’s infrastructure.
Pros:
- Real-time log management and analysis.
- Built-in observability of metrics at scale.
- Lots of integrations.
Cons:
- Hard to track real delay because the request goes through various levels of the Google Cloud Platform (GCP).
- Suitable only for GCP environments.
- Complicated pricing system. It is difficult to estimate in advance how much something is going to cost.
Link: https://cloud.google.com/products/operations
6. CloudWatch
CloudWatch is an AWS-native offering from Amazon Web Services. It collects both monitoring and operational data from AWS and visualizes it within a single automated dashboard. This allows you to look at and correlate logs and metrics to understand the root cause of issues. Logs can be analyzed with CloudWatch’s own purpose-built query language that supports aggregations, filters, and regex. You can also send logs to Elasticsearch via Lambda.
Overall, CloudWatch is a great choice if you already work with Amazon services. It can also be used in hybrid cloud architectures and uses Agent or API for monitoring on-premises resources. CloudWatch is used by plenty of big names like Airbnb, Deliveroo, 9GAG, and others. It can also save companies millions annually thanks to DynamoDB TTL.
Pros:
- Built-for-purpose to monitor AWS resources.
- Has burstable instances metrics (t2 CPU credit balance).
- Detailed monitoring and auto-scaling groups.
Cons:
- It can only be used for AWS services.
- Not a lot of customization options for dashboards.
- Doesn’t support transaction tracing.
Link: https://aws.amazon.com/cloudwatch/
7. Fluentd
Fluentd is a cross-platformed open-source data collector offering a unified logging layer (but it is not a standalone log manager). This is quite a popular tool that has more than 5,000 users, including Atlassian, Microsoft, and Amazon. Looking at the clients, we can conclude a high level of reliability and performance. Also, Fluentd creates a unified logging layer that helps you use data more efficiently and iterate it quickly on your software. This tool can help you to process 120,000 records per second as they did for LINE.
Pros:
- Large community and plugin ecosystem.
- Unified logging layer.
- Proven reliability and performance.
- Simple start; can be installed in less than 10 minutes).
Cons:
- Difficult to configure.
- Limited support for transforming data.
- Not a complete logging solution.
Link: https://www.fluentd.org/
Conclusion: How to Choose the Right Tool?
Firstly, I should explain why I didn’t include Prometheus on the list as I am sure you expected to see it. The reason is that this article is focused on log monitoring tools while Prometheus deals with metrics and doesn’t support logs.
So, if you are sick of manually hunting through logs for the root cause, or sick of building and managing alert rules, you should try Zebrium with its AI and ML-based algorithms. It will likely save a lot of time and free you from the laborious task of creating lots of rules. It looks like an extremely interesting approach to logging.
But if you’re looking for something more mainstream and know which alert rules to create — or you don’t trust AI — try Loki or Sematext, they’re efficient tools that will suit you if you haven’t used log monitoring tools before. They will be particularly useful if you already use products from Grafana or Sematext Cloud/Enterprise.
In case you use Google’s GCP offerings for your project, a good and quite obvious variant for you might be Google Operations.
If you have multiple or exotic sources for your logs, try Fluentd with its unified logging layer, but you’ll still need a logging tool. And of course, if you’re an AWS user, CloudWatch will be the natural choice for you.
In any case, I hope you’ve enjoyed the article. If you know of any other great Kubernetes log management tools, share them with me in the comments. I plan to update this article in the future.