Process, Store and Analyze JSON Data with Ultimate Flexibility. Check out the blog and new demo! -->
Process, Store and Analyze JSON Data with Ultimate Flexibility. Check out the blog and new demo! -->
Start Free Trial

ChaosSearch Blog

9 MIN READ

CloudFront Logs in Amazon S3, Quicker than Ever

We have been finding more and more companies coming to CHAOSSEARCH to help them deal with the flood of logs that are being generated by their content delivery services such as Amazon CloudFront. In today’s world where your customers can exist all around the globe, it's important to make sure that your websites’ application assets are as close to the users as possible.

Amazon makes it incredibly easy to enable logging for your specific distribution — and will automatically send your logs to an S3 bucket of your choosing. Unfortunately, in order to get any value out of your logs, you would need to ingest them into a separate database, like Elasticsearch or Redshift. Maybe you are trying to track and analyze your bandwidth per distribution. Or perhaps you are trying to identify bot traffic by analyzing your top user agent strings per endpoint.

It’s no wonder that users end up feeling like they are struggling to get anything useful out of their CloudFront logs. Even when referring to Amazon documentation, their recommended solution for deep insights into CloudFront logs is by sending your logs from S3 -> Lambda -> Kinesis Firehose -> S3 -> Partitioning Lambda -> Athena.

Reduce Complexity and Quickly Search Amazon CloudFront Logs in Amazon S3

This solution, even while avoiding the necessity of having to run a database environment yourself to get insights, is still extremely complex and expensive to maintain. Amazon Athena charges $5 per TB “scanned” which can easily grow cost prohibitive when attempting to gain insights over months and years. Companies that generate 1 TB of CloudFront logs per day will need to spend $150 per QUERY in order to ask questions across one month of their log data. One other huge limitation of the Amazon Athena system is the inability to run text search queries across your entire data. With Athena you are limited to SQL style queries only, even as APIs like the Elasticsearch API has grown to become the defacto standard for log search and analytics.

Reduce Complexity and Quickly Search Amazon CloudFront Logs in Amazon S3

The power of the CHAOSSEARCH platform is to simplify search on your data in your Amazon S3 bucket. We can remove a significant portion of the complexity for customers who are looking to get quick, usable insights into their CloudFront log data. In this post, I’ll talk about how you can setup CloudFront logging, how to create an Amazon Lambda function to process the log data into JSON, and then how to start getting quick answers to your questions. All within minutes.

CloudFront Log Process

At a high level we’ll go through the following steps:

  1. Create a bucket for raw CloudFront logs
  2. Create a bucket for Lambda processed logs
  3. Create our Lambda and update IAM permissions
  4. Enable CloudFront logging on our distribution
  5. Index the processed logs with CHAOSSEARCH
  6. Start getting answers to our questions

The first step is to create a couple of buckets within your Amazon AWS account — I’m creating 2 buckets. One bucket for my raw CloudFront logs and one bucket for my post-processed log data.

Reduce Complexity and Quickly Search Amazon CloudFront Logs in Amazon S3

Next, I’m going to create a new Lambda function, and I’m going to have it create a new IAM role with basic Lambda permissions.

Reduce Complexity and Quickly Search Amazon CloudFront Logs in Amazon S3

You can use our example code for the creation of your Amazon Lambda available on our GitHub — or you can see the example below:

DISCLAIMER: This is a “product manager” level of code written below — it’ll get you where you gotta go, but it may not work for your specific use case. It should give you an idea of just how simple it is to process these logs into an easy to parse format like JSON.

require 'zlib'
require 'time'
require 'json'
require 'aws-sdk-s3'
 
def lambda_handler(event:, context:)
  event = event['Records'].first
  filename = event['s3']['object']['key']
  source_bucket = event['s3']['bucket']['name']
 
  destination_bucket = ENV['DEST_BUCKET']
  aws_region = ENV['AWS_REGION']
  filedate = Date.parse(filename.split('.')[1]).to_s
 
  s3 = Aws::S3::Resource.new(region: aws_region)
 
  source_file = s3.bucket(source_bucket).object(filename)
 
  data = Zlib::GzipReader.new(source_file.get.body).read.split("\n")
 
  logfile = Array.new
 
  def gzip(data)
    sio = StringIO.new
    gz = Zlib::GzipWriter.new(sio)
    gz.write(data)
    gz.close
    sio.string
  end
 
  logline_schema = [
    'date','time','edge_location','sc_bytes','c_ip','cs_method','cs_host','cs_uri_stem','sc_status','cs_referer','cs_user_agent','cs_uri_query','cs_cookie','edge_result_type','edge_request_id','host_header','cs_protocol','cs_bytes','time_taken','forwarded_for','ssl_protocol','ssl_cipher','edge_response_result_type','cs_protocol_version','fle_status','fle_encrypted_fields']
 
  data.each do |line|
    logline = Hash.new
    unless line.start_with?("#")
      line.split("\t").each_with_index do |line_value, idx|
        logline[logline_schema[idx]] = line_value
      end
      logline['timestamp'] = Time.parse("#{logline['date']} #{logline['time']} UTC").iso8601
      logfile << logline.to_json
    end
  end
 
  processed_filename = "#{File.basename(filename, '.gz')}-processed.json.gz"
  obj = s3.bucket(destination_bucket).object([filedate,processed_filename].join('/'))
  begin
    response = obj.put(body: gzip(logfile.join("\n")))
  rescue Aws::S3::Errors::ServiceError => e
    puts e.message
  end
 
  puts "Sucessfully wrote #{processed_filename} with etag #{response.etag}"
end

Now that my Lambda function is created, I simply need to set the environment variable for my destination bucket — and make sure that my Lambda is notified when new log files are written into my CloudFront log bucket.

Reduce Complexity and Quickly Search Amazon CloudFront Logs in Amazon S3
Reduce Complexity and Quickly Search Amazon CloudFront Logs in Amazon S3

After that new IAM role is created for my Lambda, I’ll just want to add a new policy to the IAM permissions to make sure my Lambda has the ability to both read from and write to my two S3 buckets.

{
    "Version": "2012-10-17",
    "Statement": [
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:Get*"
            ],
            "Resource": [
                "arn:aws:s3:::pete-cloudfront-logs/",
                "arn:aws:s3:::pete-cloudfront-logs/*"  
        },
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "s3:Put*"
            ],
            "Resource": [
                "arn:aws:s3:::pete-cloudfront-processed/",
                "arn:aws:s3:::pete-cloudfront-processed/*"
}

Now that everything is set up, we can go and turn on CloudFront logging for your distribution and have those logs start to get sent over. 

Reduce Complexity and Quickly Search Amazon CloudFront Logs in Amazon S3
Reduce Complexity and Quickly Search Amazon CloudFront Logs in Amazon S3

Assuming that everything has been set up correctly, you should start to see logging events landing into your source bucket in about 5-10 minutes. Depending on the size of the logs, the Lambda function should process them and drop them into the destination bucket within seconds.

My source bucket:

Reduce Complexity and Quickly Search Amazon CloudFront Logs in Amazon S3

My destination bucket (while not necessary for CHAOSSEARCH I’ve had my Lambda drop the files into a prefix per day).

Reduce Complexity and Quickly Search Amazon CloudFront Logs in Amazon S3

Since CHAOSSEARCH never needs to return to the source data, we can now enable an Amazon S3 lifecycle rule to purge files from our source bucket older than a few days. If I need to keep the data for compliance reasons I can always setup Amazon S3 Intelligent Tiering and send those logs off to Glacier.

Reduce Complexity and Quickly Search Amazon CloudFront Logs in Amazon S3

Now I can navigate over to the CHAOSSEARCH platform and create an object group of all my processed CloudFront logs.

Reduce Complexity and Quickly Search Amazon CloudFront Logs in Amazon S3

And since I have my Lambda function setup to continually drop new log files into my destination S3 bucket, I can enable CHAOSSEARCH to continually process data leveraging SQS notifications for each PUT message to S3. This will ensure that CHAOSSEARCH continually processes your log data in real time, making it available for search as it lands in your S3 buckets.

Reduce Complexity and Quickly Search Amazon CloudFront Logs in Amazon S3

One of the best features of the CHAOSSEARCH platform is automated schema discovery of your log data, which means you don’t need to spend any time creating database schemas, or index mappings. During the indexing process, we will automatically identify strings, numbers, time values, etc. And since CHAOSSEARCH leverages a Schema-on-Read architecture, you can always adjust your schema any time without EVER having to reindex your data.

Reduce Complexity and Quickly Search Amazon CloudFront Logs in Amazon S3

Now we can dive into the fully-integrated Kibana interface and get deep insights into our CloudFront log data.  We can dive into response times from the Edge for both cache hits and misses, understand our customer usage patterns, and even identify potentially malicious user traffic by analyzing source IP addresses and user agent tracking.

Reduce Complexity and Quickly Search Amazon CloudFront Logs in Amazon S3

Going from raw data within an Amazon S3 bucket to deep insights is something that is now able to happen within minutes. No need to spend time cobbling together multiple different database solutions in order to get answers to your questions. Simply point CHAOSSEARCH at your Amazon S3 buckets and index your data without ever having to move your data out of Amazon S3.

Reach out today and learn more about how you can get quick answers to all your CloudFront questions without ever having to move your data out of your Amazon S3 buckets.

About the Author, Pete Cheslock

Pete Cheslock was the VP of Product for ChaosSearch, where he was brought on as one of the founding executives. In his role, Pete helped to define the go-to-market strategy and refine product direction for the initial ChaosSearch launch. To see what Pete’s up to now, connect with him on LinkedIn. More posts by Pete Cheslock