Troubleshooting cloud services and infrastructure is an ongoing challenge for organizations of all sizes. As organizations adopt more cloud services and their cloud environments grow more complex, they naturally produce more telemetry data – including application, system and security logs that document all types of events. All cloud services and infrastructure components generate their own, distinct logs.
Troubleshooting a problem, at times, can feel like searching for a needle in a haystack. Sifting through massive amounts of log data can be both unproductive and impractical. Finding the root cause of an issue often requires a lengthy investigation into raw log data, without knowing which specific event triggered the problem in the first place.
Even when real-time cloud observability systems are in place, teams often still have to sift through historical log data to determine what went wrong. What’s more, the data retention windows on systems like security information and event management (SIEM) and other monitoring and observability tools are often less than 30 days; this timespan is mostly adequate for day-to-day operations, but gaining meaningful insight into the source of a persistent or longstanding issue requires months or more of log data.
Applying log analytics can help reduce some of the headaches associated with troubleshooting common cloud infrastructure and services issues. Uncovering these issues faster can help improve incident management KPIs, which include mean time to know (MTTK), mean time to repair (MTTR), and mean time between failure (MTBF), among others.
In this article, we’ll dive into log analytics, why it’s important, types of log data, common cloud infrastructure issues, best practices for cloud troubleshooting, and how to effectively store and query your logs in the event of an issue.
The proliferation of cloud computing services and infrastructure has led to an explosion of log data. This log data is crucial to understanding cloud performance and security issues alike. DevOps teams are responsible for dealing with issues in code and the connection between code and the cloud production environment. Log analytics software solutions are used to collect, aggregate, analyze, and visualize computer-generated log data from sources throughout the IT environment.
Key capabilities of log analytics solutions include:
DevOps and SecOps teams use log analytics for forensic analysis, and to monitor cloud environments that support systems and applications. Let’s learn more about the typical types of log data used for troubleshooting in AWS and Google Cloud.
Cloud monitoring services often capture metrics, metadata and events that can help inform DevOps teams about the status of cloud-based applications and infrastructure. There are many types of logs used for troubleshooting cloud services and infrastructure. Among them include:
Within AWS specifically, there are several types of logs DevOps teams typically monitor for log analytics, including:
Beyond AWS, each cloud has its own services for which logs are generated. Applications, containers, microservices, compute nodes, and other components also generate logs, which add to the volume of data through which DevOps teams must sift to find meaningful insights.
Related Content: How Log Analytics Powers Cloud Operations: Three Best Practices for CloudOps Engineers
While there are thousands of things that could go wrong within a cloud environment, the most typical cloud issues include the following:
Troubleshooting these issues require different approaches, yet there are some common best practices that can be leveraged across the board.
Watch: Why and How Log Analytics Makes Cloud Operations Smarter
Once you know there’s a problem, there are important steps you must take to mitigate the risks of an issue persisting. Unlike navigating on-premises IT troubleshooting tasks, troubleshooting cloud infrastructure within the shared responsibility model of public cloud providers requires sharp communication skills with your provider on what you’ve done to fix the issue yourself. Google outlines typical cloud troubleshooting best practices for site reliability engineers (SREs), which include:
While working with a cloud provider on an issue means that you lose some element of control over the situation at hand, it’s critical to maintain a time-stamped record of troubleshooting steps you’ve taken so far, with screenshots and any relevant log snippets or other documentation attached.
Many companies may be looking for low cost, efficient ways to perform log analytics on their long-term data. This data can reveal critical insights on recurring patterns in your cloud infrastructure, which you can leverage to optimize cloud performance, security, and more. Often, detecting ongoing cloud issues requires more than in-the-moment data available via monitoring and observability platforms.
Using a solution like ChaosSearch, you can turn your cloud object storage such as S3 or Google Cloud Storage (GCS) into an analytical data lake, where you can query your log data directly without complex ETL processes or data movement. In addition, you can integrate ChaosSearch with popular cloud services to track trends and diagnose problems.
It’s also very common for organizations to maintain short-term operational data in their cloud monitoring and observability tools, but export longer term data into S3 or GCS for cheaper, longer-term storage. An Elasticsearch service can be used to analyze that data, but even a managed Elasticsearch service experiences performance and reliability issues at scale. ChaosSearch can replace your Elasticsearch service with a more cost-efficient, scalable log analytics platform. With ChaosSearch, customers can perform scalable log analytics on AWS S3 or GCS, using the familiar Elasticsearch API for queries, and Kibana for log analytics and visualizations, while reducing costs and improving analytical capabilities.
Read: Elasticsearch Replacement for Log Analytics at Scale
Using Kibana and its alerting features, you can monitor your AWS and Google Cloud environments, and other cloud services directly from the logs you’re ingesting in ChaosSearch, setting thresholds and automating alerts and actions.
A centralized log management solution is now a must for any company operating in the cloud. Container services alone create a blizzard of logs. With improved log coverage, it is not only simpler to find that needle in a haystack, you can have a more complete picture of the health of your cloud to prevent the needle from getting in the haystack in the first place.