Solution Brief

Reign in the Growing Cost of Elasticsearch

The Challenge: Storing months and years of log data in Elasticsearch is cost prohibitive

Long term retention of structured log and event data is not one of the original designs of the underlying storage technology of Elasticsearch – Lucene. Searching across unstructured data like email and files to find specific documents that match a query are some of the early use cases for these technologies. With unstructured data, the data processed in Lucene indexes typically end up being much smaller than the source data. However, with log and event storage – Lucene indexes can be larger than the source data that is indexed. The flood of log and event data forces operators to continue to add additional Elasticsearch servers and expensive disk to each system to continue to scale.

Running Elasticsearch on Amazon EBS can be challenging to scale

Elasticsearch documentation still recommends avoiding cheaper storage options such as Amazon EBS storage. They only recommend even using EBS for small clusters (a couple of servers). The official documentation for Elasticsearch recommends just using Amazon instance storage (such as local NVME disk), and this will provide you the ability to get millisecond speed answers to your questions, but at an incredible cost. For Elasticsearch operators who follow this recommendation, they must continue to provision additional servers to keep up with the storage requirements for the cluster. Alternatively, spend the time to design a Hot/Warm Elasticsearch cluster and create lifecycle tasks to migrate indexes to the warm servers.

To keep up with the flood of log and event data, operators must continue to add additional Elasticsearch servers and expensive disk to each system.

Offloading Indexes to S3 can slow down your time to answers

To avoid building a complicated Hot/Warm Elasticsearch cluster, many operators leverage the built-in AWS plugins for Elasticsearch. They snapshot their indexes directly to Amazon S3 to age out old data from the cluster. This plugin allows you to keep the data available on your own Amazon S3 account and ensures the hot Elasticsearch cluster does not get overloaded with the massive volume of log and event data.

However, there is some hidden complexity to this approach. Indexes sent to Amazon S3 first need to be restored to a running, hot Elasticsearch cluster before you can ask questions. If your existing Elasticsearch cluster does not have the available disk space to restore to, you would need to provision additional servers to complete the restore operation. This is a time consuming and expensive process that drastically increases the time required to gain value from your data.

A New Strategy with CHAOSSEARCH

Leverage the low cost of Amazon S3 for long term log retention

A key benefit of the CHAOSSEARCH platform is the ability to read data sources on Amazon S3 and write the index in a highly compressed format to your S3 bucket. Once your data is indexed, CHAOSSEARCH has no need to return to the source data for analysis. Therefore, you can set up an Amazon S3 lifecycle policy to migrate your source data to Amazon Glacier for even more significant savings. This functionality enables you to take advantage of the low cost of Amazon S3, and maintain control of your data at the same time.

Separate storage from compute and scale to petabytes

Don’t depend on expensive EBS, or even more costly NVME instance store volumes, to scale your log and event data clusters. The CHAOSSEARCH platform uses your Amazon S3 as the only data store for log and event data. We’ve separated the storage of log data on Amazon S3 and the compute with Amazon EC2, which means, we can independently scale the speed and latency of both data indexing and data query – no matter how large your data set is. CHAOSSEARCH makes it easy to scale your search and visualization analysis to petabytes of data.

Don’t wait for answers – store everything and ask anything

Use your Amazon S3 account to store and index all your log and event data without ever having to move the data or process into a separate database. CHAOSSEARCH stores all of the indexes in your Amazon S3 account in a highly compressed state, yet still fully searchable and queryable. Leave all your data fully indexed within your Amazon S3 bucket, and get quick answers to your questions. Don’t spend time and energy building Elasticsearch clusters to support restoring your Lucene indexes to ask questions. Just leverage the CHAOSSEARCH platform to search, query, and visualize your data, all instantly, all without ever having to move your data.