Back in 2009, I joined a startup that wanted to solve a problem around email archiving. At that time Amazon Web Services had only been publicly available for a few years — with SQS, EC2, and S3 being a few of their very early services. This tiny company wanted to use the economies of the cloud to solve the problem of storing data for long periods of time. Being able to scale compute and storage as the company grew was a huge reason that they were one of the earliest and largest users of S3 and EBS.
But we struggled with how to make this data (mostly email and document data) available to customers in an easy way. We tried MANY different data stores until in 2010 we came across a new data store that had figured out a way to scale Lucene indexes: Elasticsearch. We very quickly became a huge early user of Elasticsearch — at one point employing many of the top ES code committers.
Fast forward eight years, and Elasticsearch is now one of the most popular open source databases in use. Driven by the other popular open source projects of Kibana and Logstash, which made it incredibly simple to consume and process machine-generated logs and provided an easy-to-use API endpoint to query this data. As someone who’s scaled Elasticsearch clusters to petabyte scale with hundreds of billions of documents, it’s been one of the most amazing tools to have at my disposal. The problem is that you are left with dozens, or hundreds, of Elasticsearch servers, which were designed for the fast consumption and query of data — yet keeping data for any reasonable amount of time becomes too expensive. Due to cost reasons, many companies (including ones I’ve worked for in the past) end up simply throwing away the data. The data becomes a liability to the business from a cost standpoint. If you can’t store it cost effectively AND query that data — why even bother keeping it around?
That’s why I was so excited when I met with the CHAOSSEARCH folks and had a chance to see what they had built. This team figured out a way to finally decouple storage and compute when it comes to managing and querying huge datasets. They can process your data as it sits in your S3 bucket, and make that data both compressed (for cost and performance) as well as queryable using similar Elasticsearch APIs. When I learned their new storage format was lossless as well, was when I knew I needed to get much more involved in the company. There was now finally a way to get value out of all that data sitting on S3 as well as finally making it cheap for companies to SAVE the data they had been throwing away.
Imagine now you have the ability to query datasets over 90, 180, or even 365 days without needing to run ANY Elasticsearch servers. Since the new storage format is lossless, customers can shove their raw data to Glacier (or just delete it) and save even more money and headaches.
I am thrilled to join CHAOSSEARCH as VP, Product and am excited to build out the Product organization and help customers solve their data problems. As a technical operator for nearly 20 years, I understand how hard of a problem this is to solve, making the long tail of data available in a cost-effective way. I don’t think companies should have to choose between saving money and keeping data that could have value to the business. For the first time in forever, companies can save MORE data, run FEWER servers, and spend LESS money. I can’t wait to work with and help companies finally realize the value of their data, in ways they never thought possible.
Over the next few months, I’m going to be looking for companies who can help us be design partners. If you are a company running on AWS, using Elasticsearch, and wishing you could store more data and run fewer servers, please reach out to me (my Twitter DMs are open). I would love to chat about your use cases and how we can help you do more with your data.