ChaosSearch Blog - Tips for Wrestling Your Data Chaos

ChaosSearch Data Refinery: transform without reindexing

Written by Pete Cheslock | Aug 6, 2019

Traditional databases suffer a problem when ingesting data. They operate on a schema-on-write approach where data indexed must have a predefined schema as you ingest your data into the database. This schema-on-write model means that you need to take time in advance to dive into your data and understand what is there, and then process your data in advance to fit the defined schema. This data cleaning and data processing can be time-consuming and costly, not only with the engineering time to build but also the computing costs associated with it.

Now for many companies’ log data, they can adjust code within their software to clean the data to prepare it for indexing, but what about data coming from sources that you can't control? Various Amazon cloud services can generate vast amounts of log and event data and send these logs to your Amazon S3 buckets, but you can't modify the data before it lands in S3. You'd have to build various Lambda functions or other post-processing tasks to parse and convert the data into an appropriate format after it's been written to your bucket.

The ultimate goal of CHAOSSEARCH is to help our customers get quick insights into their data without ever having to move their data out of Amazon S3. That's why I'm incredibly excited to announce the CHAOSSEARCH Data Refinery today. Now you can transform your already indexed data, creating new fields that can be searched for as well as adjusting the objects' schema on the fly. No longer do you need to extract data and transform it (ETL) just to get better insights into your logs, you can now do these virtual transformations as different data views.

With CHAOSSEARCH, we are able to quickly create materialized views that effectively and accurately parse our various log formats.

Transeo

Legacy search technologies like Elasticsearch operate on a schema-on-write approach, and to change the schema for any data field, you would need to reindex that data. Add a new field or change a field type? Change your mapping, reindex your data. Find a mistake in your mapping, change your mapping, reindex your data. Maybe not a big deal if you only have a few hundred GB of data total, but cost and time prohibitive when dealing with 10s or 100s of TBs of data.

With this new release of CHAOSSEARCH not only can you transform the field into multiple new fields to query and aggregate, but you can also adjust the schema, changing fields from strings to integers, allowing you to search on ranges. An entirely new view into your data, created within seconds, all without ever having to spend time and money reindexing your data.

One example we see from customers is with Amazon ELB logs. These logs have a unique format.

timestamp elb client:port backend:port request_processing_time backend_processing_time response_processing_time elb_status_code backend_status_code received_bytes sent_bytes "request" "user_agent" ssl_cipher ssl_protocol

The annoying part of these logs is that both the client and the instance IP addresses include the port within the string. If you had already indexed this data or didn't want to spend any time in advance to parse it, you'd be stuck with the inability to analyze this data quickly.

In this example, you can see the initial indexing process identified both the client and backend IP as a string — CHAOSSEARCH automatically infers schema on your data to speed up your time to insights.

When I go into the CHAOSSEARCH embedded Kibana interface and search for an IP address...

client_ip : "235.237.171.163"

You can see no results:

We'd have to know the port as well to find it, OR we can add a wildcard to the end of the query to find it.

However, this isn't a great user experience — and this also makes it nearly impossible to see the most and least frequent IP addresses because the port number exists in all aggregations.

With the new CHAOSSEARCH Data Refinery, we can transform this field in seconds using a regular expression, and split this into a string field and a number field.

Now we have 2 new fields:

client_ip_address
client_port

Back in Kibana, I can immediately see a new view created that includes the new fields with the updated schema.

Now I can run a search JUST for the IP address I'm looking for — no wildcards required.

Also, most importantly I can run aggregations on the new data fields that I've created and get quick answers to my questions, all without having to write code to reformat my data, and all without having to spend time and money reindexing my data. I can use these virtual transformations to potentially mask out PII or other identifiable customer data before I run various reports. 

CHAOSSEARCH is helping customers get immediate insights into their data without having to do anything in advance. Dump your data into Amazon S3, index it one time with CHAOSSEARCH, and transform and modify it in endless ways. Store everything. Ask anything.

Reach out for a trial and try this new feature today!