CHAOSSEARCH is a service for searching, querying, introspecting and interrogating your historical log and event data. All you need to do is gather data from any source – from your devices (IoT, mobile, browsers), vendor software (services, applications), or your own code, and dump it directly into your S3 account. CHAOSSEARCH is a new approach with innovative technology that both manages the tsunami of log and event data being generated today, as well as, unlocking the value of its analysis. Yet first and foremost, CHAOSSEARCH is an analytic service for detecting and inspecting historical log and event data over days, weeks, months and years, directly in your own S3 object storage infrastructure. This service is not (most fundamentally not) like anything you have seen before – there are several important advantages and differences.
By 2020 the digital universe will reach 44 zettabytes — IDC
A New Approach
There is nothing available on the market quite like CHAOSSEARCH. Our service is at the forefront of storage and analytic convergence. Here is how our approach, feature set, and value compares to traditional logging solutions.
The Storage Difference
The growth in object storage, particularly Amazon S3, has surprised everyone and is now the preferred platform for collecting, storing and archiving today’s growing mountain of diverse and disjointed data. With its simple, elastic, and cost-efficient characteristics, object storage is fostering new trends in data lake architectures and business intelligence. In a nutshell, Amazon S3 provides;
- Limitless scalability – proven scale-out architecture
- High Availability – high data durability and protection
- Service-based – service oriented cloud-based storage
- Meta tagging – infrastructure for categorizing information
S3 is the logical place to analyze historical log and event data. — Thomas Hazel , CTO and Founder
The first step in scaling historical log and event analytics is “where” and how such data will be stored to be analyzed. Businesses spend a lot of time and effort into figuring out how to reduce the cost of today’s logging solutions – mostly by reducing the amount of data indexed and stored. What most businesses do is delete data after 7 days, or simply archive it into S3 for future analytics – the problem with that is you need to wake it up and move it back into other systems if you want to look at the data. At CHAOSSEARCH we wanted to address this cost, complexity and scale problem – so we created an architecture that decouples storage from compute, and we virtually turned S3 into a warm, searchable Elastic cluster. S3 is transformed from a basic dumping ground for data into a live, searchable log management solution. By leveraging S3 cost economics (e.g. $25 for a TB) and its ease of use (i.e. store anything at any scale), CHAOSSEARCH turns “your” S3 account into a log management and analytic platform that can cost-effectively provide “live” access to terabytes of data over months and years. – No data movement. No data duplication. No transformation.
A New Indexing Paradigm
Data size equals data cost. CHAOSSEARCH is based on a new, patent-pending database file format called “Data Edge”. The premise behind Data Edge is to compress/index data to its theoretical minimum. Data indexed by Data Edge is compressed by up to 5X – and sometimes more. And it supports both text search and relational queries on the same data footprint.
Data Edge is a representation of dimensional locality. One can view such dimensionality as a locality of reference, where data and references are optimized for both compression and indexing. This representation allows for greater compression ratios than standard tools using the same compression algorithms. For instance, Data Edge using the Burrows–Wheeler algorithm results in better compression than the actual bzip2 tool. But unlike compression tools, Data Edge allows for both text search and relational queries via its unique representation and does so simultaneously. 1TB indexed by Data Edge compresses to 200GB and can do it losslessly. Data indexed by Data Edge can be perfectly reconstructed from this data format.
Data Edge is able to model and transform data virtually into different shapes, even if the source is irregular or malformed. The advantages of such a locality of reference design are its instant schema manipulation, aggregation, and correlation without physical ETL. And, unlike traditional ETL, Data Edge transforms include live ordering/filtering of a data source. In other words, result sets can be live (and not just static); as the data source changes so can this virtual target. This is particularly relevant to real-time or streaming use-cases, like log and event data. Data Edge, with its dimensional structure, parallel algorithms can be applied for discovery, refining, and querying, particularly at distributed scale.
In summary, data can be stored in Amazon’s S3 at its true minimum size and not be ETL-ed back Amazon’s EMR to then be duplicated into Amazon’s Redshift or Amazon’s Elasticsearch for final analysis.
Data Edge scales your data, not your infrastructure. — Thomas Hazel , CTO and Founder
A Distributed Architecture
Based on the capabilities of the Data Edge technology, CHAOSSEARCH was built from the ground up to be elastic and serverless. And in the era of cloud and big data, the need for scale as a base axiom is essential. Given the ability to discover, refine, and query in a concurrent and distributed fashion – based on Data Edge, this has enabled us to consider a telecom inspired distributed architecture – an always-on, highly available, message pass system that decouples storage from compute.
The team has spent years designing and building data and telecom networks, and knew the time, effort and cost required to create a proprietary framework. Doing it yourself just didn’t make sense. As a result, we decided to select a time proven telecom philosophy and corresponding framework, but based on a powerful open-source foundation. Accordingly, we chose Akka, an open source Erlang inspired framework, based on the Scala programming language (think Erlang on a Java Virtual Machine).
This philosophy is derived by design principles that have been used in the telecom industry for years, summarized in Erlang’s co-creator Joe Armstrong’s PhD thesis:
- Everything is a process
- Error handling is non-local
- Processes share no resources
- Processes have unique names
- Processes are strongly isolated
- Processes do what they are supposed to do or fail
- Process creation and destruction is a lightweight operation
- Message passing is the only way for processes to interact
- If you know the name of a process, you can send to it
Joe Armstrong remarked in an interview several years ago: “If Java is ‘write once, run anywhere’, then Erlang is ‘write once, run forever’.” The CHAOSSEARCH Team agrees!
Today, the chaossearch.io service is focused on S3 and the AWS cloud. However, architecturally this service is cloud agnostic. From our use of the Scala/Akka framework to our use of Docker Swarm, the SEARCH architecture is distributed and portable. In particular, Docker Swarm is a major aspect of our service capabilities. There is no one aspect of Swarm we don’t utilize.
High-performance at Low Cost
A significant challenge to businesses today is the inability to retain and analyze the deluge of log and event data generated by all sorts of applications and devices – at low cost. At CHAOSSEARCH, we have designed and built – from the ground up – a service that addresses the cost prohibitive nature of storing and analyzing log and event data over months and years, without sacrificing speed or performance.
Our focus from the start has been to build a high performance distributed architecture that can take advantage of modern cloud computing design patterns, most notably scaling in/out with different sized workloads. However, our underlying data fabric and unique data representation, called Data Edge, is not a brute force solution – we don’t just throw horsepower behind it. Data Edge was designed from the ground up to be fast and compact, not just one use case, but several.
In order to maintain this performance through product iterations, we routinely benchmark against other technologies. CHAOSSEARCH has been architected to be priced at a fraction of the cost of traditional logging solutions – reducing the cost to store, search, query and visualize historical log data by 90%.
CHAOSSEARCH is solving critical and complex problems without sacrificing quality and cohesiveness, all while keep the cost extremely competitive — Mo Wajdan,
Lead Software Architect, Jitterbit
Turns S3 into an Elastic cluster
CHAOSSEARCH is a platform that provides a “high-definition” lens on top of your S3 infrastructure. With CHAOSSEARCH you can discover your log and event data (think cataloging), organize it (think object grouping) to ultimately be indexed. We’ve extended S3 with an Elasticsearch API and built Kibana directly into the console. We provide a “data refinery” where you can create new dataset by aggregating and/or correlating these groupings (think relational join) and are automatically published as new analytic indices – accessible from an Elasticsearch API and Kibana interface. With cost-effective access to months and years of log and event data you can now ask questions like;
- How often were server errors in proportion to successful?
- Has this exception occurred since fix update last month?
- During this time last year, was this endpoint ever accessed?
- What requests are causing us high rates of errors?
- Have our error rates improved significantly over the last year?
- Have our response times improved since last fix/upgrades?
- What country has been accessing our site over the past year?
- How many purchases were made on cyber Monday last year?
All on Your S3 infrastructure
Choosing a solution that is responsible for data retention and analysis is a monumental one. The reasoning is simple, vendor lock-in. In other words, once data is in a proprietary solution, it can be expensive to move it into a new solution. Typically this is caused by scale problems, missing features, or just the sheer cost there of it. At CHAOSSEARCH we do not want to lock you in and we don’t want to store your data. We only require read-only access to “your” S3 data. And we’re easy to turn off. With a simple change of a IAM role, we no longer have rights to “your” data. We also wanted to provide a “standard” API and Interface so that customers don’t have to change how they manage and analyze their data. As a result, we exported an S3 API for storage and Elasticsearch API for analytics. Each are now the “de facto” standard today in their respective areas. With CHAOSSEARCH, we unlock the value of “your” log and event data, without locking “you” in.