Troubleshooting Cloud Services and Infrastructure with Log Analytics

Written by George Hamilton | Aug 26, 2021

Troubleshooting cloud services and infrastructure is an ongoing challenge for organizations of all sizes. As organizations adopt more cloud services and their cloud environments grow more complex, they naturally produce more telemetry data – including application, system and security logs that document all types of events. All cloud services and infrastructure components generate their own, distinct logs.

Troubleshooting a problem, at times, can feel like searching for a needle in a haystack. Sifting through massive amounts of log data can be both unproductive and impractical. Finding the root cause of an issue often requires a lengthy investigation into raw log data, without knowing which specific event triggered the problem in the first place.

Even when real-time cloud observability systems are in place, teams often still have to sift through historical log data to determine what went wrong. What’s more, the data retention windows on systems like security information and event management (SIEM) and other monitoring and observability tools are often less than 30 days; this timespan is mostly adequate for day-to-day operations, but gaining meaningful insight into the source of a persistent or longstanding issue requires months or more of log data.

Applying log analytics can help reduce some of the headaches associated with troubleshooting common cloud infrastructure and services issues. Uncovering these issues faster can help improve incident management KPIs, which include mean time to know (MTTK), mean time to repair (MTTR), and mean time between failure (MTBF), among others.

In this article, we’ll dive into log analytics, why it’s important, types of log data, common cloud infrastructure issues, best practices for cloud troubleshooting, and how to effectively store and query your logs in the event of an issue.

What is Log Analytics?

The proliferation of cloud computing services and infrastructure has led to an explosion of log data. This log data is crucial to understanding cloud performance and security issues alike. DevOps teams are responsible for dealing with issues in code and the connection between code and the cloud production environment. Log analytics software solutions are used to collect, aggregate, analyze, and visualize computer-generated log data from sources throughout the IT environment.

Key capabilities of log analytics solutions include:

Log Data Collection and Aggregation - Log analytics solutions collect, aggregate, and centralize log data for analysis. These solutions gather log data from a broad spectrum of sources that includes virtual machines (VMs), containers, storage, operating systems, network infrastructure, applications, and endpoint devices.
Log Data Normalization - Various data sources tend to format their logs in different ways, so most log analytics solutions offer a means of normalizing log data such that a single unified index can be used for analysis.
Log Indexing, Storage and Retention - Normalized log data must be indexed for rapid retrieval before it can be searched, queried, and analyzed. Log analytics solutions offer more cost-effective long-term data storage.
Querying and Analytics - Log analytics tools allow teams to run queries and perform log analysis on indexed data to research root cause or to discover potential issues before they impact production systems.
Visualization and Dashboarding - Log analytics solutions offer visualization and dashboarding features that make it easier for DevOps teams to consume data or report on the results of log analytics operations.

DevOps and SecOps teams use log analytics for forensic analysis, and to monitor cloud environments that support systems and applications. Let’s learn more about the typical types of log data used for troubleshooting in AWS and Google Cloud.

Useful Log Data for Troubleshooting

Cloud monitoring services often capture metrics, metadata and events that can help inform DevOps teams about the status of cloud-based applications and infrastructure. There are many types of logs used for troubleshooting cloud services and infrastructure. Among them include:

Event logs: These provide information about network traffic, usage and more. For example, event logs can capture login sessions or other activity on a network, or record application errors.
Transaction logs: These log files list changes to a database or cloud storage environment, and are commonly associated with SQL server transactions.
Message logs: These logs document activity from messages, including email, chat and more.
Audit logs: Audit logs may vary between applications and systems but typically capture events that show who did what, and how the system responded.

Within AWS specifically, there are several types of logs DevOps teams typically monitor for log analytics, including:

CloudTrail: CloudTrail enables you to log, continuously monitor, and retain account activity related to actions across your AWS infrastructure. This includes the event history of your AWS account activity, such as actions taken through the AWS Management Console, AWS SDKs, command line tools, and other AWS services.
Elastic Load Balancing (ELB): ELB logs capture detailed information about requests sent to your load balancer, such as the time the request was received, the client's IP address, latencies, request paths, and server responses. These access logs are useful for analyzing traffic patterns and troubleshooting issues.
VPC Flow: VPC Flow Logs help you capture information about the IP traffic going to and from network interfaces in your virtual private cloud (VPC). Flow logs can help diagnose overly restrictive security group rules, monitor traffic reaching your instance, and determine the direction of traffic to and from the network.
Route 53: These logs are focused on queries to the Domain Name System (DNS), such as the domain or subdomain that was requested, the date and time of the request, and the DNS record type.

Beyond AWS, each cloud has its own services for which logs are generated. Applications, containers, microservices, compute nodes, and other components also generate logs, which add to the volume of data through which DevOps teams must sift to find meaningful insights.

Typical Cloud Services and Infrastructure Issues

While there are thousands of things that could go wrong within a cloud environment, the most typical cloud issues include the following:

Cloud security and configuration management: Maintaining consistent configuration management across all of your cloud infrastructure can be a major challenge. In fact, hackers often exploit common cloud misconfigurations, which include using default credentials or accidentally exposing credentials, exposed ports or poorly secured S3 buckets, and more.
Cloud availability and latency issues: Sometimes issues can occur on the user’s side, and other times, server-side issues may cause latency and availability issues for public cloud services.
Cloud application performance issues: Cloud applications can be delayed or fail for a number of reasons, including how they’re built and configured, poor database performance, or the cloud computing services themselves.
Cloud cost issues: Many organizations face out-of-control cloud costs, due to the way in which they’re utilizing cloud resources. Sometimes unexpected surges or unattended projects can cause unexpected spikes in the next month’s cloud bill.
Multicloud deployment issues: Organizations embracing a multicloud approach need to learn to do the same things differently across cloud platforms. Learning a new system can be a costly and error-prone process.

Troubleshooting these issues require different approaches, yet there are some common best practices that can be leveraged across the board.

Watch: Why and How Log Analytics Makes Cloud Operations Smarter

Cloud Troubleshooting Best Practices

Once you know there’s a problem, there are important steps you must take to mitigate the risks of an issue persisting. Unlike navigating on-premises IT troubleshooting tasks, troubleshooting cloud infrastructure within the shared responsibility model of public cloud providers requires sharp communication skills with your provider on what you’ve done to fix the issue yourself. Google outlines typical cloud troubleshooting best practices for site reliability engineers (SREs), which include:

Triage: Mitigate the impact if you can
Examine: Gather observations and share them with your cloud provider or other team members
Diagnose: Create a hypothesis that explains the observations
Test and treat:

Identify tests that may prove or disprove the hypothesis
Execute the tests and agree on the meaning of the result
Move on to the next hypothesis; repeat until solved

While working with a cloud provider on an issue means that you lose some element of control over the situation at hand, it’s critical to maintain a time-stamped record of troubleshooting steps you’ve taken so far, with screenshots and any relevant log snippets or other documentation attached.

How to Store and Query Logs for Log Analytics

Many companies may be looking for low cost, efficient ways to perform log analytics on their long-term data. This data can reveal critical insights on recurring patterns in your cloud infrastructure, which you can leverage to optimize cloud performance, security, and more. Often, detecting ongoing cloud issues requires more than in-the-moment data available via monitoring and observability platforms.

Using a solution like ChaosSearch, you can turn your cloud object storage such as S3 or Google Cloud Storage (GCS) into an analytical data lake, where you can query your log data directly without complex ETL processes or data movement. In addition, you can integrate ChaosSearch with popular cloud services to track trends and diagnose problems.

It’s also very common for organizations to maintain short-term operational data in their cloud monitoring and observability tools, but export longer term data into S3 or GCS for cheaper, longer-term storage. An Elasticsearch service can be used to analyze that data, but even a managed Elasticsearch service experiences performance and reliability issues at scale. ChaosSearch can replace your Elasticsearch service with a more cost-efficient, scalable log analytics platform. With ChaosSearch, customers can perform scalable log analytics on AWS S3 or GCS, using the familiar Elasticsearch API for queries, and Kibana for log analytics and visualizations, while reducing costs and improving analytical capabilities.

Read: Elasticsearch Replacement for Log Analytics at Scale

Using Kibana and its alerting features, you can monitor your AWS and Google Cloud environments, and other cloud services directly from the logs you’re ingesting in ChaosSearch, setting thresholds and automating alerts and actions.

A centralized log management solution is now a must for any company operating in the cloud. Container services alone create a blizzard of logs. With improved log coverage, it is not only simpler to find that needle in a haystack, you can have a more complete picture of the health of your cloud to prevent the needle from getting in the haystack in the first place.

Related Resources

Read the Eckerson Group whitepaper: Can CloudOps Be Both Agile and Stable?
Watch the Video: Kubernetes Log Analysis Made Easy
Watch the Webinar: Why and How Log Analytics Makes Cloud Operations Smarter
Read the Whitepaper: The Business Case for Switching from ELK to ChaosSearch

View full post