ChaosSearch Blog - Tips for Wrestling Your Data Chaos

What is ChaosSearch?

Written by Thomas Hazel | Jul 31, 2018

A Completely New Approach

ChaosSearch is a fully managed service for searching, finding, introspecting, and interrogating your historical log and event data. At a high level, we’ve extended the Elasticsearch API and Kibana onto Amazon S3 and Google Cloud Storage (GCS), cost-effectively opening up access to search, query, and visualize terabytes of log and event data.

All you need to do is gather data from any source — from your clients (mobile, IoT, browsers, applications), vendored software, or your own code, and dump it directly into your S3 or GCS account. ChaosSearch is a new approach with innovative technology that manages the tsunami of log and event data being generated today as well as unlocks the value of its analysis.

Although, first and foremost, ChaosSearch is a search and analytic service for detecting and inspecting historical log and event data over days, weeks, months, and years, directly in your own cloud object storage . This service is not (most fundamentally not) like anything you have seen before — there are several important advantages and differences...

“By 2020, the digital universe will reach 44 zettabytes” — IDC

There is nothing available on the market quite like ChaosSearch. Here is how our approach, feature set, and value compare to traditional logging solutions.

 

An Explosion in Log and Event Data

The move to the cloud is happening at warp speed. Whether you’re building a cloud-native application on AWS or you’re a software-driven business moving applications to the cloud, you’re probably seeing an explosion in log and event data. SaaS businesses generate 100’s of GB log and event data daily, with many generating data in the terabytes. But, if you’re running a logging solution based on Elasticsearch, cost and data access can be a major issue. Most businesses archive or delete log and event data after a few days or weeks because it’s too cost-prohibitive to retain with existing Elasticsearch-based log management solutions. ChaosSearch has extended the Elasticsearch API to run directly on S3 and GCS, allowing you to separate hot and warm data stores and reduce the cost and complexity of your long-term, warm Elasticsearch stack. It’s a completely new approach that extends the Elasticsearch API and Kibana directly onto S3 and GCS, cost-effectively opening up access to a long tail of log and event data.

 

Based on Cloud Object Storage

The growth in object storage, has surprised everyone and is now the preferred platform for collecting, analyzing, and archiving today’s growing mountain of diverse and disjointed data. With its simple, elastic, and cost-efficient characteristics, object storage is fostering new trends in data lake architectures and business intelligence. In a nutshell, cloud object storage provides:

  • Limitless scalability — proven scale-out architecture

  • Efficient — high data durability protects data

  • Software-based — deploy on standard servers or cloud-based resources

  • Metadata tagging — Embedded infrastructure for categorizing information

The first step in scaling historical log and event analytics is “where” and how such data will be stored to be analyzed. Businesses spend a lot of time and effort figuring out how to reduce the cost of today’s logging solutions — mostly by reducing the amount of data indexed and stored. What most businesses do is delete data after 7 days, or simply archive it into S3/GCS for future analytics. The problem with this is that you need to wake the data up and move it back into other systems for further analytics. At ChaosSearch we wanted to address this cost, complexity and scale problem — so we created an architecture that decouples storage from compute, and we turned S3/GCS into a warm, searchable analytics data lake Cloud object storage is transformed from a basic dumping ground for data into a live, searchable log management solution. By leveraging S3/GCS cost economics and ease of use , ChaosSearch turns “your” cloud object storage into an analytics data lakethat can cost-effectively provide “live” access to terabytes and petabytes of data over months and years. — No data movement. No data duplication. No transformation.

 

A New Indexing Paradigm

Data size equals data cost. The foundation for ChaosSearch is the Chaos Index®, a new, patented indexing technology. The Chaos Index features the UltraHot® universal data format, which reduces the size of information beyond what has ever been achieved while still fully indexing it. The premise behind the Chaos Index is to store and compress data to its theoretical minimum. Data indexed by Chaos Index is compressed by up to 20X — and sometimes more. And it supports both text search and relational queries on the same data footprint.

Chaos Index is a representation of dimensional locality via infinite matrices. One can view such matrices as a locality of reference, where symbols and references are optimized for both compression and cataloging. This representation allows for greater compression ratios than standard tools using the same compression algorithms. Chaos index, using Burrows-Wheeler algorithm results in better compression than the actual bzip2 tool. However, unlike compression tools, Chaos Index allows for both relational and text analytic queries via its unique symbol matrices representation and does so simultaneously.

 

A New Way to Prepare Data for Analysis

The Chaos Refinery® models and transforms data virtually into different shapes, even if the source is irregular or malformed. The advantages of such a locality of reference design are its instant schema manipulation, aggregation, and correlation without physical ETL. Also, unlike traditional ETL, Chaos Refinery transforms include live ordering/filtering of a data source. In other words, result sets can be live (and not just static); as the data source changes so can this virtual target. This is particularly relevant to real-time or streaming use-cases, like log and event data. With ChaosSearch’s dimensional structure, parallel algorithms can be applied for discovery, refining, and querying, particularly at distributed scale.

In summary, using Amazon as an example, data can be stored in S3at its true minimum size and not be ETL-ed back to Amazon EMR to then be duplicated into Amazon Redshift or Amazon OpenSearch for final analysis.

ChaosSearch scales the data — not the infrastructure.

 

A New Distributed Architecture

Based on the capabilities of the Chaos Index technology, ChaosSearch was built from the ground up to be elastic and massively scalable. In the era of cloud and big data, the need for scale as a base axiom is essential. Given the ability to discover, refine, and query in a concurrent and distributed fashion, the Chaos Fabric® has enabled us to implement a telecom inspired distributed architecture — an always-on, highly available, message pass system that decouples storage from compute.

The team has spent years designing and building data and telecom networks, and knew the time, effort and cost required to create a proprietary framework didn’t make sense. As a result, we decided to select a time-proven telecom philosophy and corresponding framework, based on a powerful open-source foundation. Accordingly, we chose Akka, an open source Erlang inspired framework, based on the Scala programming language (think Erlang on a Java Virtual Machine).

The ChaosSearch service is focused on S3 and GCS. However, architecturally this service is cloud-agnostic. From our use of the Scala/Akka framework to our use of Docker Swarm, the ChaosSearch architecture is distributed and portable. In particular, Docker Swarm is a major aspect of our service capabilities — we utilize every aspect. The following are key aspects of our use of Swarm:

Portability / Discovery / Security / Scalability / Scheduling / Performance

When pairing all this with our Angular2 user interface, the ChaosSearch service is a stack of components perfectly distributed and uniquely integrated.

 

High Performance and Low Cost

A significant challenge to businesses today is the inability to retain and analyze the deluge of log and event data generated by all sorts of applications and devices — at low cost. At ChaosSearch, we have designed and built — from the ground up — a service that addresses the cost prohibitive nature of storing and analyzing log and event data over months and years, without sacrificing speed or performance.

Our focus from the start has been to build a high-performance distributed architecture that can take advantage of modern cloud computing design patterns, most notably scaling in/out with different sized workloads. However, our underlying data fabric and unique data representation,, is not a brute force solution – we don’t just throw horsepower behind it. ChaosSearch was designed from the ground up to be fast and compact, not just for one use case, but several.

In order to maintain this performance through product iterations, we routinely benchmark against other technologies. ChaosSearch has been architected to be:

  • As fast as an RDBMS for data ingestion
  • On par with an RDBMS relational operations (order, join, group by)
  • As fast as Elasticsearch for text search
  • As small on disk as a column store
  • Lower cost of ownership than everything else

And, we have priced the solution at a fraction of the cost of traditional logging — reducing the cost to store, search, query, and visualize historical log data by 80%.

 

Extending the Elasticsearch API and Kibana onto S3

ChaosSearch is a platform that provides a “high-definition” lens on top of your S3 and GCS infrastructure. With ChaosSearch you can discover your log data, organize it, catalog it, and quickly index it. We support the Elasticsearch API and built Kibana directly into the console. We provide the Chaos Refinery where you can create new dataset groupings that can be universally and automatically indexed such that they can be sorted, aggregated, and/or correlated (think relational join) to be published as new analytic indices — accessible from an Elasticsearch API and Kibana interface. With cost-effective access to months and years of log and event data you can now ask questions like:

  • How often do we get internal server errors in proportion to successful requests?
  • Has this system exception occurred since we implemented a fix last month?
  • During this time period last year, was this endpoint ever accessed?
  • Which requests are causing us high rates of errors?
  • Have our error rates improved significantly over the last year?
  • Have our response times improved since updating our systems 3 months ago?
  • What country outside of the US has been accessing our site with high frequency over the past year?
  • How many purchases per minute were made on Cyber Monday of last year?

Choosing a solution that is responsible for data retention and analysis is a monumental one. The reasoning is simple: vendor lock-in. In other words, once data is in a proprietary solution, it can be expensive to move it into a new solution. Typically this is caused by scale problems, missing features, or just the sheer cost of it. At ChaosSearch we do not want to lock you in and we don’t want to store your data. We only require read-only access to “your” S3 or GCS data — and we’re easy to turn off. With a simple change of an IAM role, we no longer have rights to “your” data. We also wanted to provide a “standard” API and Interface so that customers don’t have to change how they manage and analyze their data. As a result, we exported an S3 API and GCS REST APIs for storage and Elasticsearch API for analytics. Each is the “de facto” standard today in their respective areas. With ChaosSearch, we unlock the value of “your” log and event data, without locking “you” in.

So that’s it. That’s who we are and that’s what we’re doing at ChaosSearch...