CHAOSSEARCH
Start Free Trial Request Demo

Get to the Bucket – Part 3

Intro

I don’t want to manage an Elasticsearch cluster. It was one of the main reasons I joined CHAOSSEARCH. To be rid of the accidental complexity of ES, and help others do the same. But my job is as an SRE, and logs will be created, and need to be searched, and they need to be stored somewhere. So let’s pick apart getting CloudWatch logs into S3 so it can be indexed by CHAOSSEARCH, and try to avoid resorting to a fat client like Logstash, Fluentd, or Cloudwatchbeat.

Part three in a multi part series.

Find Part 1 and Part 2 here.

Part Three: The Lambda-ing

In part two I talked about how we avoid running servers as much as possible. In lieu of a Nagios / Sensu / Monitoring SaaS, we use Lambdas fired from CloudWatch logs events to help monitor our environment. But how do we get the log data from our Lambdas into S3?

We asked our good friend Tom Mclaughlin over at ServerlessOps what the right way to deal with Lambda logs are, and he pointed at these (in my humble opinion) heavyweight serverless libraries. The engineer side of my brain is sure he’s right, but “ain’t nobody got time for” learning a new library, so we defaulted to some janky S3 PUTs, some CloudWatch logs, and some Python.

What we started with

We found these cute little “save JSON object to S3” snippets over on Stack Overflow. It does zero error checking, but it sure is compact isn’t it?

s3 = boto3.resource("s3").Bucket(bucket_name)
json.load_s3 = lambda f: json.load(s3.Object(key=f).get()["Body"])
json.dump_s3 = lambda obj, f: s3.Object(key=f).put(Body=json.dumps(obj)))

If we get the IAM rules for the Lambda S3 access correct, we can create a log object during execution and write it to S3 at the end of execution. We tricked ourselves into thinking that some clever error catching would leave us with a stable log stream. We weren’t completely wrong; we had trouble breaking the log stream, but a whole host of things could happen that would leave us blind, so we abandoned this approach.

Abandoning is kind of a lie — we didn’t totally abandon S3 for our Lambdas. Several of our monitoring Lambdas write to S3 for long term trending analysis, which seems to work wonderfully, and lets us avoid looking at the logs directly unless something drastically bad happens with the monitoring code. We could have saved stats in Dynamo, but we are an S3 full text search product, so dumping everything into S3 makes sense for us.

CloudWatch, you’re wonderful, but you broke my heart

Lambdas dump all their text output into a CloudWatch stream. Reading CloudWatch logs is the way I was introduced to debugging Lambdas. It works, but it’s not exactly user-friendly. Log streams can be parallelized, with multiple streams receiving events at the same time. There is minimal structure to the logs, and it can be tricky to find what you are looking for. Searching across streams is basically a non-starter. If it was your only tool, you’d make it work, but it’s not pleasant.

CloudWatch logs are ugly!

Let’s pick apart a CloudWatch log for a second. This is a run of one of the monitoring Lambdas.

12:17:12 START RequestId: 9357f08c-8dfe-4652-b5c9-d1cbf8e2b07d Version: $LATEST
12:17:15 Starting
12:17:15 (2019-03-01_121715) OK: _Daily_2019-03-01_ - Hits: 500
12:17:15 END RequestId: 9357f08c-8dfe-4652-b5c9-d1cbf8e2b07d
12:17:15 REPORT RequestId: 9357f08c-8dfe-4652-b5c9-d1cbf8e2b07d Duration: 3152.19 ms    Billed Duration: 3200 ms Memory Size: 128 MB    Max Memory Used: 55 MB

Look at this hot garbage! Don’t get me wrong, there is a ton of good information in here, but unless you have some spectacular regex skills, you’re going to have trouble getting all the rows associated with a single request. Did you notice the lack of a date in the timestamp? Is that AM or PM? UTC or local? It’s hard to work with, but it’s data, and data is valuable, so we’re going to try to make use of it.

PROTIP:

The context object passed into each executed Lambda includes the aws_request_id. You may find it helpful to prefix all lines of output of your Lambda with this request id. In a better world, the Lambda’s output would coalesce into a single structured object for easy searching. In fact, several applications, like Cloudwatchbeat and Logstash can do this for you, but CloudWatch doesn’t do it natively, Logstash is fat, and being able to grep on a request id when you’ve downloaded the logs is really nice.

05:05:40 START RequestId: 2d4fd0ec-cae3-4318-b4b0-a705e3abd828 Version: $LATEST
05:05:40 2d4fd0ec-cae3-4318-b4b0-a705e3abd828: Starting
05:05:40 2d4fd0ec-cae3-4318-b4b0-a705e3abd828: OK: _Daily_2019-03-02_ - Hits: 500
05:05:48 END RequestId: 2d4fd0ec-cae3-4318-b4b0-a705e3abd828
05:05:48 REPORT RequestId: 2d4fd0ec-cae3-4318-b4b0-a705e3abd828 Duration: 3834.61 ms    Billed Duration: 3400 ms Memory Size: 128 MB    Max Memory Used: 58 MB

Full text search is fast with CHAOSSEARCH. Even with wildcards as a prefix and suffix, it doesn’t really matter where in the message the request id is. message:”*2d4fd0ec-cae3-4318-b4b0-a705e3abd828*” would be pretty brutal on ES, but is totally reasonable for CHAOSSEARCH (or if you’re grepping a local file).

  • No hate mail — protip was used ironically.

So let’s ship them…

We talked about the format of the logs, and how to add some data to the logs to make them easier to work with. How do we get it out of that rotten CloudWatch logs interface?

How about a script to do a one-off export of all the CloudWatch logs in a region to a bucket? I’m going to leave the bucket creation and permissions documentation to the AWS Docs on CloudWatch s3ExportTasks. It’s worth noting you can only have one export in progress at a time, so the code has to wait for the previous export to finish before it starts the next export. This code exports the data to a bucket called cs-test-cloudwatch. Modify that to suit your needs.

The code:

  • List all the logGroups in your account
  • Create an export task for the events in the previous hour. If you ran this script at 9:15 AM, it would export the logs from 8AM to 9AM
  • Waits for the export task to finish
  • Reports on what data it shipped
#!/usr/bin/env python3

import boto3
import time
from datetime import datetime

region_name = "us-east-1"
destination = "cs-test-cloudwatch"
shipped = []

def is_task_done(context, task_id):
    log_client = boto3.client("logs", region_name=region_name)

    task_list_return = log_client.describe_export_tasks(taskId=task_id)

    if len(task_list_return['exportTasks']) == 0:
        print("Didn't find any tasks running")
        return True

    for task in task_list_return['exportTasks']:
        print("{} is {}".format(task['taskId'], task['status']['code']))

        if task['taskId'] == task_id and task['status']['code'] == 'COMPLETED':
            return True

time_now = datetime.now()
last_hour_start = int(time_now.replace(second=0, microsecond=0, minute=0, hour=time_now.hour - 1).timestamp() * 1000)
last_hour_end = int(time_now.replace(second=0, microsecond=0, minute=0).timestamp() * 1000)

print("Now {}\n\tFrom\t{}\n\tTo\t{}".format(time_now, last_hour_start, last_hour_end))

log_client = boto3.client("logs", region_name=region_name)
log_groups = log_client.describe_log_groups()

for log_group_obj in log_groups['logGroups']:
    log_group_name = log_group_obj['logGroupName']
    shipped.append(log_group_name)

    print("log_group: {}".format(log_group_name))
    task_name =  "export_{}".format(log_group_name.split('/')[-1])

    response = log_client.create_export_task(
        taskName=task_name,
        logGroupName=log_group_name,
        fromTime=last_hour_start,
        to=last_hour_end,
        destination=destination,
        destinationPrefix="{}".format(log_group_name.split('/')[-1])
    )

    task_id = response['taskId']
    print("Created TaskID: {}".format(task_id))

    print("Sleeping for a second to make sure the task id is registered")
    time.sleep( 1 )

    while True:`       if is_task_done(context, task_id):
            print("Task done: {} - {}".format(task_id, task_name))
            break

        print("Sleeping on task {} - {}".format(task_id, task_name))
        time.sleep( 1 )

print("Successful Run - shipped {}".format(shipped))

Next steps

How long did that script take for you to run? For my use case, there are only a few hundred Lambda invocations per hour. The export takes ~30 seconds. That is plenty short enough to run in a Lambda. It would probably stay in the free tier. But we have some limitations to fix.

Known bugs

  • Code doesn’t work right if run in the first hour of the day UTC
  • Code will error out if an existing task is running when it starts
  • It’s possible for export tasks to hang, this code has no timeouts / cancels for stuck exports

Indexing the data in CHAOSSEARCH

I’ve been avoiding talking about the core CHAOSSEARCH product. Most of the logs in this series have already been in JSON, which is easy to index by default. However, this CloudWatch data comes in such a terrible log format that we should show you the tricks to index it into something sane.

A snippet of the logs shows that it includes a much more reasonable timestamp than you get in the CloudWatch logs dashboard, but alternating blank lines for no reason. It also drops files in S3 named aws-logs-write-test.

We’re going to use a regex for the key name to only select *.gz files. We then use a regex on the log format to ignore the whitespace garbage, and to split the lines into timestamp and message field.

  • Select Filter: .*gz
  • Data format: Log
  • Compression: gzip
  • Regex: (?<timestamp>.*?) (?<message>[^\\n]*)[\\n]*\\n

That’s it. Nothing too fancy, but super effective.

When we took this screenshot, we hadn’t taken our advice to prefix the RequestID in each log line yet, but this was still a good start.

Wrapping up

This is the temporary end of the 3 part series on shipping data to S3. It was not exhaustive, but it should give you some ideas on getting data to big cheap storage. Rumor has it, someone’s working on a clever way to ship CloudWatch thru Kinesis before it hits S3, maybe that’s what you can read about next.

What’s your favorite way to ship files to S3? Hit me up: @platformpatrick.