If you’re an Amazon Web Services (AWS) user, you’re probably familiar with some of Amazon’s native services available for logging and monitoring, such as CloudWatch and CloudTrail. With that said, log management can get complicated quickly, especially if you’re dealing with a high volume of logs from AWS Lambda functions or a multi-cloud/hybrid cloud environment.
In this blog, we’ll introduce the basics of AWS logging along with five tips and best practices for optimizing your log management in AWS.
AWS logging is the practice of aggregating, normalizing, indexing, and analyzing log data from cloud-based applications on AWS.
Log data consists of machine-generated records of events, user activity, and transactions that take place within an application or cloud service. Common types of log data include:
Each time an event or transaction takes place on a logging-enabled system, the newly generated log data is written into a log file that establishes a chronological record of everything that happened within the application or service.
Log files from a variety of sources can be aggregated and analyzed to support a variety of different use cases:
Getting logging right starts with how well you know the strengths and limitations of native AWS monitoring tools. Many times, you can build a cloud-native observability stack that complements these tools in areas where they become difficult to navigate. Here are five AWS logging best practices you can follow for analyzing logs more effectively.
CloudWatch is Amazon's primary monitoring and logging service built into the AWS cloud. You can move logs to CloudWatch logs from all of your AWS services and workloads – from CloudTrail and EKS, to Route53, to individual EC2 instances running the CloudWatch agent, to Lambda functions. CloudWatch, through CloudWatch Logs Insights, provides basic search and analytics capabilities, such as visualizations, to help you interpret log and metrics data. And it lets you configure alarms, which can alert you to anomalies or sudden changes in workload performance patterns.
While AWS CloudWatch works well for basic monitoring and alerts, on its own, it may not be the best solution for log data at scale. Limitations include user interface and scalability issues can hold users back from utilizing logs in CloudWatch for troubleshooting use cases.
AWS CloudWatch collects and visualizes telemetry from cloud resources and applications in real-time.
CloudTrail is different from CloudWatch because it keeps a record of every API call within your AWS account (vs. the application- and service-level logs and metrics in CloudWatch). CloudTrail gives you deeper visibility into the activity within your account, helping you see who did what and when. You can use the CloudTrail logs not only to track the security of the user access but also for operational troubleshooting.
Historically, the native way provided by AWS to analyze CloudTrail data was a simple UI that lets you run some basic searches on events over the last 90 days worth of data. However, early in 2022, AWS delivered CloudTrail Data Lake, an additional service to store and enable querying of CloudTrail data using the familiar SQL query language.
Amazon CloudTrail collects records of user activity and API calls to AWS services.
Even with both of these tools in play, you may still face some of these common AWS monitoring challenges.
Watch this quick demo to see how ChaosSearch can help:
Read: Going Beyond CloudWatch: 5 Steps to Better Log Analytics & Analysis
The first step to effective log analysis is setting up your AWS data lake the right way. Configure it to ingest and store raw data in its source format – before any cleaning, processing, or data transformation takes place.
Storing data in its raw format gives you the opportunity to interrogate your data and generate new use cases for enterprise data. The scalability and cost-effectiveness of Amazon S3 data storage means that you can retain your data in the cloud for long periods of time, and query it months, or even years down the road.
When your data lake stores logs in its raw format, it also means that nothing is lost. As a result, your AWS data lake becomes the single source of truth for all the raw data you ingest.
Architectural diagram for a foundational enterprise data lake. AWS customers can deploy and manage the solution themselves, or with support from AWS managed services.
One thing to note: Amazon S3 offers different classes of cloud storage, each one cost-optimized for a specific access frequency or use case. Generally speaking, Amazon S3 Standard is a solid option for your data ingest bucket, where you’ll be sending raw structured and unstructured data from both your cloud and on-prem applications.
Data retention policies establish how long your log data will be retained in the index before it is automatically expired and deleted. These policies also determine how much historic log data will be available for analysis at any given time. So, a data retention period of 90 days means that developers and security teams will have access to a rolling 90-day window of indexed log data for analysis.
Shorter data retention windows result in lower data storage costs, but the trade-off is that you quickly lose access to analyze older log data that could support long-term log analytics use cases ranging from advanced persistent threat detection to incident investigation, long-term user trends, root cause analysis, among others. On the other hand, retaining log data for longer provides deeper access to retrospective log data and better support for long-term log analytics use cases. If you’re using tools like CloudWatch or the ELK stack – costs and performance can degrade as log data volume increases. So many people delete data they otherwise need for analysis.
That doesn’t have to be the case. Gaining total observability of your network security posture and application performance requires a centralized log management solution that enforces a small storage footprint for your logs and supports security operations, DevOps, and CloudOps use cases.
Many companies address their observability needs of full, centralized visibility into metrics, logs and traces by buying a holistic application performance management (APM) solution from one vendor, such as DataDog.
However, if you do that, you may find that while centralizing all telemetry in a single platform works well at first, it creates significant challenges at scale. The underlying data technologies for monitoring, troubleshooting and trend analysis/reporting are fundamentally different, often leading to ballooning costs, reduced data retention, increased operational burden and limited ability to answer relevant analytics questions.
To avoid these challenges and reduce DataDog costs, a better approach may be to build a best-of-breed observability solution made up of open source tools and open APIs. The Cloud Native Computing Foundation (CNCF) provides a good basis from which to start with its OpenTelemetry project. This open-source project is a collection of tools, APIs, and SDKs you can use to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your applications’ performance and behavior.
So, instead of relying on a single tool to explore all telemetry, a cloud-native observability stack might look something like this.
Source: Building a Cost-Effective Full Observability Solution Around Open APIs and CNCF Projects
As mentioned above, many companies have a multi-cloud or hybrid cloud strategy that makes log analysis in CloudWatch difficult to manage. Even if all of your workloads run in AWS, you may not be able to collect and analyze as much data from them as you would prefer, since CloudWatch only supports specific predefined log and metrics types.
To extend your log analytics strategy beyond CloudWatch, deploy a logging solution like ChaosSearch that gives more control over the log and metrics data you expose, as well as how you interact with it. There are typically two ways to do this:
To sum up, it’s tempting to go for the default AWS tools like CloudWatch and CloudTrail to manage your log data, but they may not be enough – particularly if you have a complex serverless or microservices-based architecture, or if you are operating in a multi-cloud or hybrid cloud environment.
There are many reasons why this is true – from scalability and usability issues, to cost. To make the most of your logs and reduce AWS log costs, ideally you should centralize them (along with log files from other sources) using AWS S3 as a data lake. From there, you can rely on a best-of-breed observability approach to understand all of your telemetry – including logs, metrics, and traces.