9 MIN READ

CloudFront Logs in Amazon S3, Quicker than Ever

By Pete Cheslock on Jun 27, 2019

We have been finding more and more companies coming to CHAOSSEARCH to help them deal with the flood of logs that are being generated by their content delivery services such as Amazon CloudFront. In today’s world where your customers can exist all around the globe, it's important to make sure that your websites’ application assets are as close to the users as possible.

Amazon makes it incredibly easy to enable logging for your specific distribution — and will automatically send your logs to an S3 bucket of your choosing. Unfortunately, in order to get any value out of your logs, you would need to ingest them into a separate database, like Elasticsearch or Redshift. Maybe you are trying to track and analyze your bandwidth per distribution. Or perhaps you are trying to identify bot traffic by analyzing your top user agent strings per endpoint.

It’s no wonder that users end up feeling like they are struggling to get anything useful out of their CloudFront logs. Even when referring to Amazon documentation, their recommended solution for deep insights into CloudFront logs is by sending your logs from S3 -> Lambda -> Kinesis Firehose -> S3 -> Partitioning Lambda -> Athena.

Reduce Complexity and Quickly Search Amazon CloudFront Logs in Amazon S3

This solution, even while avoiding the necessity of having to run a database environment yourself to get insights, is still extremely complex and expensive to maintain. Amazon Athena charges $5 per TB “scanned” which can easily grow cost prohibitive when attempting to gain insights over months and years. Companies that generate 1 TB of CloudFront logs per day will need to spend $150 per QUERY in order to ask questions across one month of their log data. One other huge limitation of the Amazon Athena system is the inability to run text search queries across your entire data. With Athena you are limited to SQL style queries only, even as APIs like the Elasticsearch API has grown to become the defacto standard for log search and analytics.

The power of the CHAOSSEARCH platform is to simplify search on your data in your Amazon S3 bucket. We can remove a significant portion of the complexity for customers who are looking to get quick, usable insights into their CloudFront log data. In this post, I’ll talk about how you can setup CloudFront logging, how to create an Amazon Lambda function to process the log data into JSON, and then how to start getting quick answers to your questions. All within minutes.

CloudFront Log Process

At a high level we’ll go through the following steps:

Create a bucket for raw CloudFront logs
Create a bucket for Lambda processed logs
Create our Lambda and update IAM permissions
Enable CloudFront logging on our distribution
Index the processed logs with CHAOSSEARCH
Start getting answers to our questions

The first step is to create a couple of buckets within your Amazon AWS account — I’m creating 2 buckets. One bucket for my raw CloudFront logs and one bucket for my post-processed log data.

Next, I’m going to create a new Lambda function, and I’m going to have it create a new IAM role with basic Lambda permissions.

You can use our example code for the creation of your Amazon Lambda available on our GitHub — or you can see the example below:

DISCLAIMER: This is a “product manager” level of code written below — it’ll get you where you gotta go, but it may not work for your specific use case. It should give you an idea of just how simple it is to process these logs into an easy to parse format like JSON.

require 'zlib'
require 'time'
require 'json'
require 'aws-sdk-s3'
 
def lambda_handler(event:, context:)
  event = event['Records'].first
  filename = event['s3']['object']['key']
  source_bucket = event['s3']['bucket']['name']
 
  destination_bucket = ENV['DEST_BUCKET']
  aws_region = ENV['AWS_REGION']
  filedate = Date.parse(filename.split('.')[1]).to_s
 
  s3 = Aws::S3::Resource.new(region: aws_region)
 
  source_file = s3.bucket(source_bucket).object(filename)
 
  data = Zlib::GzipReader.new(source_file.get.body).read.split("\n")
 
  logfile = Array.new
 
  def gzip(data)
    sio = StringIO.new
    gz = Zlib::GzipWriter.new(sio)
    gz.write(data)
    gz.close
    sio.string
  end
 
  logline_schema = [
    'date','time','edge_location','sc_bytes','c_ip','cs_method','cs_host','cs_uri_stem','sc_status','cs_referer','cs_user_agent','cs_uri_query','cs_cookie','edge_result_type','edge_request_id','host_header','cs_protocol','cs_bytes','time_taken','forwarded_for','ssl_protocol','ssl_cipher','edge_response_result_type','cs_protocol_version','fle_status','fle_encrypted_fields']
 
  data.each do |line|
    logline = Hash.new
    unless line.start_with?("#")
      line.split("\t").each_with_index do |line_value, idx|
        logline[logline_schema[idx]] = line_value
      end
      logline['timestamp'] = Time.parse("#{logline['date']} #{logline['time']} UTC").iso8601
      logfile &amp;amp;amp;lt;&amp;amp;amp;lt; logline.to_json
    end
  end
 
  processed_filename = "#{File.basename(filename, '.gz')}-processed.json.gz"
  obj = s3.bucket(destination_bucket).object([filedate,processed_filename].join('/'))
  begin
    response = obj.put(body: gzip(logfile.join("\n")))
  rescue Aws::S3::Errors::ServiceError =&amp;amp;amp;gt; e
    puts e.message
  end
 
  puts "Sucessfully wrote #{processed_filename} with etag #{response.etag}"
end

Now that my Lambda function is created, I simply need to set the environment variable for my destination bucket — and make sure that my Lambda is notified when new log files are written into my CloudFront log bucket.

After that new IAM role is created for my Lambda, I’ll just want to add a new policy to the IAM permissions to make sure my Lambda has the ability to both read from and write to my two S3 buckets.

{
    "Version": "2012-10-17",
    "Statement": [
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:Get*"
            ],
            "Resource": [
                "arn:aws:s3:::pete-cloudfront-logs/",
                "arn:aws:s3:::pete-cloudfront-logs/*"  
        },
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "s3:Put*"
            ],
            "Resource": [
                "arn:aws:s3:::pete-cloudfront-processed/",
                "arn:aws:s3:::pete-cloudfront-processed/*"
}

Now that everything is set up, we can go and turn on CloudFront logging for your distribution and have those logs start to get sent over.

Assuming that everything has been set up correctly, you should start to see logging events landing into your source bucket in about 5-10 minutes. Depending on the size of the logs, the Lambda function should process them and drop them into the destination bucket within seconds.

My source bucket:

My destination bucket (while not necessary for CHAOSSEARCH I’ve had my Lambda drop the files into a prefix per day).

Since CHAOSSEARCH never needs to return to the source data, we can now enable an Amazon S3 lifecycle rule to purge files from our source bucket older than a few days. If I need to keep the data for compliance reasons I can always setup Amazon S3 Intelligent Tiering and send those logs off to Glacier.

Now I can navigate over to the CHAOSSEARCH platform and create an object group of all my processed CloudFront logs.

And since I have my Lambda function setup to continually drop new log files into my destination S3 bucket, I can enable CHAOSSEARCH to continually process data leveraging SQS notifications for each PUT message to S3. This will ensure that CHAOSSEARCH continually processes your log data in real time, making it available for search as it lands in your S3 buckets.

One of the best features of the CHAOSSEARCH platform is automated schema discovery of your log data, which means you don’t need to spend any time creating database schemas, or index mappings. During the indexing process, we will automatically identify strings, numbers, time values, etc. And since CHAOSSEARCH leverages a Schema-on-Read architecture, you can always adjust your schema any time without EVER having to reindex your data.

Now we can dive into the fully-integrated Kibana interface and get deep insights into our CloudFront log data. We can dive into response times from the Edge for both cache hits and misses, understand our customer usage patterns, and even identify potentially malicious user traffic by analyzing source IP addresses and user agent tracking.

Going from raw data within an Amazon S3 bucket to deep insights is something that is now able to happen within minutes. No need to spend time cobbling together multiple different database solutions in order to get answers to your questions. Simply point CHAOSSEARCH at your Amazon S3 buckets and index your data without ever having to move your data out of Amazon S3.

Reach out today and learn more about how you can get quick answers to all your CloudFront questions without ever having to move your data out of your Amazon S3 buckets.

DevOps, Log Analysis

About the Author, Pete Cheslock

FOLLOW ME ON:

Pete Cheslock was the VP of Product for ChaosSearch, where he was brought on as one of the founding executives. In his role, Pete helped to define the go-to-market strategy and refine product direction for the initial ChaosSearch launch. To see what Pete’s up to now, connect with him on LinkedIn. More posts by Pete Cheslock

Stack Optimizations

Use Cases

Industry Solutions

Cloud Alliances

Integrations and Technology Partners

About Us

Learn & Engage

Featured Content

ChaosSearch Blog

CloudFront Logs in Amazon S3, Quicker than Ever

CloudFront Log Process

About the Author, Pete Cheslock

Future-Proof Your Analytics at Scale

ChaosSearch Blog

CloudFront Logs in Amazon S3, Quicker than Ever

CloudFront Log Process

About the Author, Pete Cheslock

You may also like

Future-Proof Your Analytics at Scale