What is CHAOSSEARCH?

CHAOSSEARCH is a service on AWS for searching, finding, introspecting and interrogating your historical log and event data. At a high-level, what we’ve done is extend the Elasticsearch API and Kibana onto Amazon S3, cost-effectively opening up access to search, query and visualize terabytes of log and event data.

 

All you need to do is gather data from any source – from your clients (mobile, IoT, browsers, applications), vendored software, or your own code, and dump it directly into your S3 account. CHAOSSEARCH is a new approach with innovative technology that both manages the tsunami of log and event data being generated today, as well as, unlocking the value of its analysis. Although, first and foremost, CHAOSSEARCH is a search and analytic service for detecting and inspecting historical log and event data over days, weeks, months and years, directly in your own S3 object storage infrastructure. This service is not (most fundamentally not) like anything you have seen before – there are several important advantages and differences…

 

“By 2020 the digital universe will reach 44 zettabytes” – IDC

A Completely New Approach

There is nothing available on the market quite like CHAOSSEARCH. Here is how our approach, feature set, and value compare to traditional logging solutions.

An Explosion in Log and Event Data

The move to the cloud is happening at warp speed. Whether you’re building a cloud-native application on AWS or you’re a software-driven business moving applications to the cloud, you’re probably seeing an explosion in log and event data.  SaaS businesses generate 100’s of GB log and event data daily, with many generating data in the terabytes. But, if you’re running a logging solution based on Elasticsearch, cost and data access can be a major issue. Most businesses archive or delete log and event data after a few days or weeks because it’s too cost-prohibitive to retain with existing Elasticsearch-based log management solutions like ELK. CHAOSSEARCH has extended the Elasticsearch API to run directly on S3, allowing you to separate hot and warm data stores and reduce the cost and complexity of your long-term, warm ELK stack.  It’s a completely new approach that extends the Elasticsearch API and Kibana directly onto S3, cost-effectively opening up access to a long tail of log and event data.

Based on S3 Object Storage

The growth in object storage, particularly Amazon S3, has surprised everyone and is now the preferred platform for collecting, analyzing and archiving today’s growing mountain of diverse and disjointed data. With its simple, elastic, and cost-efficient characteristics, object storage is fostering new trends in data lake architectures and business intelligence. In a nutshell, Amazon S3 provides:

 

  • Limitless scalability – proven scale-out architecture
  • Efficient – high data durability protects data
  • Software-based – deploy on standard servers or cloud-based resources
  • Metadata tagging – Embedded infrastructure for categorizing information

 

The first step in scaling historical log and event analytics is “where” and how such data will be stored to be analyzed. Businesses spend a lot of time and effort into figuring out how to reduce the cost of today’s logging solutions – mostly by reducing the amount of data indexed and stored. What most businesses do is delete data after 7 days, or simply archive it into S3 for future analytics. The problem with this is that you need to wake the data up and move it back into other systems for further analytics. At CHAOSSEARCH we wanted to address this cost, complexity and scale problem – so we created an architecture that decouples storage from compute, and we turned S3 into a warm, searchable Elastic cluster. S3 is transformed from a basic dumping ground for data into a live, searchable log management solution. By leveraging S3 cost economics (e.g. $25 for a TB) and its ease of use (i.e. store anything at any scale), CHAOSSEARCH turns “your” S3 account into a log management and analytics platform that can cost-effectively provide “live” access to terabytes of data over months and years. – No data movement. No data duplication. No transformation.

A New Indexing Paradigm

Data size equals data cost. CHAOSSEARCH is based on a new, patent-pending database file format called “Data Edge”. The premise behind Data Edge is to store and compress data to its theoretical minimum. Data indexed by Data Edge is compressed by up to 5X – and sometimes more. And it supports both text search and relational queries on the same data footprint.

 

Data Edge is a representation of dimensional locality via infinite matrices. One can view such matrices as a locality of reference, where symbols and references are optimized for both compression and cataloging. This representation allows for greater compression ratios than standard tools using the same compression algorithms. Data Edge, using Burrows-Wheeler algorithm results in better compression than the actual bzip2 tool. However, unlike compression tools, Data Edge allows for both relational and text analytic queries via its unique symbol matrices representation and does so simultaneously.

 

Data Edge is able to model and transform data virtually into different shapes, even if the source is irregular or malformed. The advantages of such a locality of reference design are its instant schema manipulation, aggregation, and correlation without physical ETL. Also, unlike traditional ETL, Data Edge transforms include live ordering/filtering of a data source. In other words, result sets can be live (and not just static); as the data source changes so can this virtual target. This is particularly relevant to real-time or streaming use-cases, like log and event data. Data Edge, with its dimensional structure, parallel algorithms can be applied for discovery, refining, and querying, particularly at distributed scale.

 

In summary, data can be stored in Amazon’s S3 at its true minimum size and not be ETL-ed back Amazon’s EMR to then be duplicated into Amazon’s Redshift or Amazon’s ElasticSearch for final analysis.

 

Data Edge scales the data – not the infrastructure

Based on a Distributed Architecture

Based on the capabilities of the Data Edge technology, CHAOSSEARCH was built from the ground up to be elastic and serverless. In the era of cloud and big data, the need for scale as a base axiom is essential. Given the ability to discover, refine, and query in a concurrent and distributed fashion, based on Data Edge, this has enabled us to consider a telecom inspired distributed architecture – an always-on, highly available, message pass system that decouples storage from compute.

 

The team has spent years designing and building data and telecom networks, and knew the time, effort and cost required to create a proprietary framework didn’t make sense. As a result, we decided to select a time-proven telecom philosophy and corresponding framework, based on a powerful open-source foundation. Accordingly, we chose Akka, an open source Erlang inspired framework, based on the Scala programming language (think Erlang on a Java Virtual Machine).

 

The chaossearch.io service is focused on S3 and the AWS cloud. However, architecturally this service is cloud-agnostic. From our use of the Scala/Akka framework to our use of Docker Swarm, the CHAOSSEARCH architecture is distributed and portable. In particular, Docker Swarm is a major aspect of our service capabilities – we utilize every aspect. The following are key aspects of our use of Swarm:

 

Portability / Discovery / Security / Scalability / Scheduling / Performance

 

When paring all this with our Angular2 user-interface, the CHAOSSEARCH service is a stack of components perfectly distributed and uniquely integrated.

High-Performance and Low Cost

A significant challenge to businesses today is the inability to retain and analyze the deluge of log and event data generated by all sorts of applications and devices – at low cost. At CHAOSSEARCH, we have designed and built – from the ground up – a service that addresses the cost prohibitive nature of storing and analyzing log and event data over months and years, without sacrificing speed or performance.

Our focus from the start has been to build a high performance distributed architecture that can take advantage of modern cloud computing design patterns, most notably scaling in/out with different sized workloads. However, our underlying data fabric and unique data representation, called Data Edge, is not a brute force solution – we don’t just throw horsepower behind it. Data Edge was designed from the ground up to be fast and compact, not just one use case, but several.

 

In order to maintain this performance through product iterations, we routinely benchmark against other technologies.  CHAOSSEARCH has been architected to be:

 

  • As fast as an RDBMS for data ingestion
  • On par with an RDBMS relational operations (order, join, group by)
  • As fast as Elasticsearch for text search
  • As small on disk as a column store
  • Lower cost of ownership than everything else

 

And, we have priced the solution at a fraction of the cost of traditional logging – reducing the cost to store, search, query and visualize historical log data by 90%

Extending the Elasticsearch API and Kibana onto S3

CHAOSSEARCH is a platform that provides a “high-definition” lens on top of your S3 infrastructure. With CHAOSSEARCH you can discover your log data, organize it, catalog it and quickly index it. We’ve extended the S3 REST and Elasticsearch API onto S3 and built Kibana directly into the console. We provide a “data refinery” where you can create new dataset groupings that can be universally and automatically indexed such that they can be sorted, aggregated and/or correlated (think relational join) to be published as new analytic indices – accessible from an Elasticsearch API and Kibana interface. With cost-effective access to months and years of log and event data you can now ask questions like:

 

  • How often do we get internal server errors in proportion to successful requests?
  • Has this system exception occurred since we implemented a fix last month?
  • During this time period last year, was this endpoint ever accessed?
  • Which requests are causing us high rates of errors?
  • Have our error rates improved significantly over the last year?
  • Have our response times improved since updating our systems 3 months ago?
  • What country outside of the US has been accessing our site with high frequency over the past year?
  • How many purchases per minute were made on Cyber Monday of last year?

 

Choosing a solution that is responsible for data retention and analysis is a monumental one. The reasoning is simple, vendor lock-in. In other words, once data is in a proprietary solution, it can be expensive to move it into a new solution. Typically this is caused by scale problems, missing features, or just the sheer cost there of it. At CHAOSSEARCH we do not want to lock you in and we don’t want to store your data. We only require read-only access to “your” S3 data – and we’re easy to turn off. With a simple change of a IAM role, we no longer have rights to “your” data. We also wanted to provide a “standard” API and Interface so that customers don’t have to change how they manage and analyze their data. As a result, we exported an S3 API for storage and Elasticsearch API for analytics. Each are the “de facto” standard today in their respective areas. With CHAOSSEARCH, we unlock the value of “your” log and event data, without locking “you” in.

 

So that’s it. That’s who we are and that’s what we’re doing at CHAOSSEARCH…