How to Quickly Analyze CloudFront Cloud Logs in Amazon S3
Content delivery networks (CDNs) such as Amazon CloudFront generate a flood of log files. In today’s world where your customers are all around the globe, it's important to make sure that your websites’ application assets are as close to the users as possible.
Amazon makes it incredibly easy to enable logging for your specific CloudFront distribution — and will automatically send your logs to an Amazon S3 bucket of your choosing. Unfortunately, in order to get any value out of your log files, you would need to ingest them into a separate database, like OpenSearch or AWS Redshift. Maybe you are trying to track and analyze your bandwidth per distribution. Or perhaps you are trying to identify bot traffic by analyzing your top user agent strings per endpoint.
Regardless of the goal you’re trying to accomplish, it can be difficult to get the detailed information you need to get from your logs without creating complex data pipelines or moving data out of S3. Let’s look at the most common use cases for CloudFront logging, key challenges with analyzing this data, and a simplified approach to AWS CloudFront troubleshooting and analysis in S3.
Why Analyzing CloudFront Logs is Important
CloudFront logging and monitoring is important because it can give you visibility into your network traffic and content delivery performance. You can use CloudFront log files to understand which regions or parts of your website might need extra bandwidth or services. In addition, you can troubleshoot issues with your website by looking at historical logs for anomalies. For example, you can detect suspicious network activity or network security issues such as DDoS attacks.
CloudFront has four different types of log files, which are useful for different purposes. Let’s review each of them, and how the three can be used together for more effective log analysis.
CloudFront access logs, or CloudFront standard logs
First, CloudFront access logs, otherwise known as CloudFront standard logs, are the lowest-cost option for analyzing what’s historically happened in terms of user requests for your content.
The downside of CloudFront standard logs is that you will only be able to analyze a sample of your total logs (a specific percentage, which you can set), and Amazon does not guarantee the order of log file requests. These CloudFront access logs are delivered in a tab-delimited CSV file, which can be difficult for many systems to parse for analysis.
CloudFront real-time logs
CloudFront also generates real-time logs to get information about user requests to a distribution immediately, and in chronological order. Like CloudFront standard logs, you can define the sampling rate (or which percentage of log files you’d like to access) at a lower latency.
You can pipe the log data into Amazon Kinesis to do real-time analysis with data streams, or pipe it into Amazon Firehose to have logs converted into JSON format automatically. The downside of CloudFront real-time logs is that they’re more costly to use than standard logs.
Read: An Overview of Streaming Analytics in AWS for Logging Applications.
CloudFront edge function logging
With CloudFront, you can also collect and analyze logs from edge networking data centers worldwide. Edge locations are smaller data centers created to get content even closer to the user and further reduce latency. Edge function logging can help you get even more granular on your analysis by these sub-regions.
Administrative logging via CloudTrail
Finally, you can analyze administrative logs that are generated from the owner of the AWS service making changes to CloudFront. For example, an administrator may change content, increase buffering, increase capacity, and more. Administrative logs are integrated with CloudTrail, and can be analyzed to understand how administrators are using your system at any given time. For example, if a malicious user gains access to administrative credentials, you may see some suspicious activity reflected in these logs.
Challenges of Analyzing CloudFront Cloud Logs
CloudFront log files can be hard to analyze, and may require a lot of data transformation and data movement in order to gain any useful insights. As mentioned above, CloudFront standard logs are delivered in tab-delimited CSV format, and this data typically must be moved and transformed into JSON format to use typical analytics tools, such as Grafana, Elasticsearch or OpenSearch.
Even when referring to Amazon documentation, their recommended solution for deep insights into CloudFront logs is by sending your logs from S3 -> Lambda -> Kinesis Firehose -> S3 -> Partitioning Lambda -> Athena.
This solution, even while avoiding the necessity of having to run a database environment yourself to get insights, is still extremely complex and expensive to maintain. Amazon Athena charges $5 per TB “scanned” which can easily grow cost prohibitive when attempting to gain insights over months and years. Companies that generate 1 TB of CloudFront logs per day will need to spend $150 per QUERY in order to ask questions across one month of their log data.
One other huge limitation of the Amazon Athena system is the inability to run text search queries across your entire data. With Athena you are limited to SQL style queries only, even as APIs like the Elasticsearch API have grown to become the de facto standard for log search and analytics.
With various log files in different formats, it can also be extremely difficult to analyze and overlay different CloudFront log data types to gain deeper insights into network and administrative activity. For example, you may want to overlay administrative logs with Edge function logs to understand activity in a specific region and update administrative settings accordingly.
Finally, one of the biggest logging mistakes companies make is not retaining their logs for long enough to troubleshoot effectively and generate useful insights. While using CloudFront real-time logs with a streaming data service like Amazon Firehose is useful for understanding activity in-the-moment, it doesn’t allow for troubleshooting or investigating deeper issues that occur over time.
Gaining Insights from Cloud Logging without Data Movement
Cloud logging is inherently more challenging, due to the sheer volume of logs and varying log file formats. These issues add to the overall management complexity of tools like CloudFront, and lead to longer time to insights. However, a solution like ChaosSearch can help you keep up with logging best practices, without the complexity of ETL pipelines.
With ChaosSearch, you can keep all of your CloudFront logs (regardless of format) in Amazon S3 and analyze them on demand without costly data transformation. You can even overlay different CloudFront log types (e.g. real-time logs and standard logs, or administrative logs and Edge function logs) to do deeper analysis. For example, you can identify response times from the Edge for both cache hits and misses, customer usage patterns, and even malicious user traffic by analyzing source IP addresses and user agent tracking.
ChaosSearch enables you to use familiar tools for analysis, such as OpenSearch dashboards Grafana, and more. You can query your whole dataset at any time, without being concerned about partitioning.
Going from raw data within an Amazon S3 bucket to deep insights can happen within minutes. No need to spend time cobbling together multiple different database solutions in order to get answers to your questions. Simply point ChaosSearch at your Amazon S3 buckets and index your data without ever having to move your data out of Amazon S3.
Reach out today and learn more about how you can get quick answers to all your CloudFront questions without ever having to move your data out of your Amazon S3 buckets.
Check how ChaosSearch can analyze CloudFront logs directly from Amazon S3.