ChaosSearch Blog - Tips for Wrestling Your Data Chaos

ELK Stack Costs Add Up: Here’s How to Switch

Written by David Bunting | Feb 1, 2024

Calculating the true cost of an Elasticsearch, or ELK (for Elasticsearch, Logstash, Kibana) stack environment, can be difficult. Many teams start out thinking that free, open-source tools will save them money, when in reality, Elasticsearch costs are much higher than anticipated.

That’s because centralized logging in ELK Stack isn’t efficient. An important consideration you may overlook is how long you want to retain log data. The underlying infrastructure of Elasticsearch is complex and costly, and those costs only compound as you scale and store more data. Let’s explore some of the common use cases, why ELK for log analysis isn’t necessarily effective, and some common ELK alternatives.

 

 

Why is a Log Management Platform Important?

If you’re reading this post, chances are you already understand the importance of log management systems. Collecting log data provides the insights an organization needs to run more effectively, and more securely. The sum of an organization’s log data provides the details of the entire IT environment in real-time, or at any point in time in history. Server logs often contain details on machine and network traffic, user access, changes to applications and services, and countless other pointers used to monitor the health and security status of the IT landscape.

A log analytics solution allows users to extract the intelligence from the data by running simple searches, complex queries, conducting trend analyses, and building data visualization.

Key functions rely on log data analytics in their day-to-day operations including cybersecurity, infrastructure monitoring, customer support, and business intelligence.

The underlying architectural limits of the ELK stack makes scaling cumbersome and costly, and constrains the IT team’s ability to provide access to all log data that the various groups within their organization require, halting many of the above use cases for daily operations. Organizations need to ensure that their log management systems provide access to all relevant log data, including historical data – this data retention requirement is where many companies encounter issues today.

 

 

Why the ELK Stack is Not Free: Calculating Costs

Contrary to popular belief, there’s no such thing as getting the ELK Stack for free. Because ELK is open source, getting started is simple and relatively inexpensive. However, as ELK stack environments scale up to incorporate more data sources, and customers look to retain data beyond a few days, the overall deployment quickly becomes complex, causing costs to rise rapidly.

At the root of the problem is the distributed architecture, which requires data to be partitioned and stored across numerous shards. And separate servers must be deployed, with each one responsible for its portion of the data. While it may be easy to deploy and begin using the ELK stack with a low initial investment, most organizations quickly face ELK stack cluster sprawl, in which they are managing, and paying for, significant compute and storage resources.

Organizations must consider all of the costs and calculate the overall total cost of ownership of their ELK environment, including compute and storage, operations, and support. When the full costs are bundled and assessed, the results are often surprising. For example, the TCO of a relatively small environment can easily exceed $2 million over a three-year time period.

 

Compare & Save

Guaranteed 50% savings for spend above $20k/month or 1TB/day
 
Average Daily Ingest
1000
GB/day
 
COST PER YEAR
 
$144K
 
ChaosSearch
Unlimited
$0.30/GB
+ tenant cost
$900K
 
Datadog
30 Days
$346K
 
Cloudwatch
30 Days
$715K
 
Splunk
30 Days
$495K
 
ELK Stack
30 Days
 
SOLUTION
&
DATA RETENTION
ChaosSearch price based on ingest based model with annual contract paid upfront with single tenant in us-east-1. Competitors’ prices based on companies’ websites. Consumption-based pricing also available.

 

Small ELK Stack Environment Overview:

  • 500 GB of daily log data ingest in year one
  • 75% annual data growth rate
  • 60 day active data retention

 

3 Year ELK Stack TCO
  Year 1 Year 2 Year 3 Total 3 Year TCO
AWS Compute & Storage $361,404 $627,095 $1,064,394 $2,052,894
Operations Staffing $21,750 $27,750 $44,813 $94,313
Elastic Software Support $44,000 $68,000 $108,000 $220,000
TOTAL $427,154 $722,845 $1,217,207 $2,367,206

 

From this, it’s easy to understand how customers with larger environments face exorbitant annual costs, with TCOs in the tens of millions of dollars. Indeed, the paper shows that a customer with 20 TB of log data ingested per day will face a 3-year TCO of over $65 million!

 

 Large ELK Stack Environment Overview:

  • 20 TB of daily log data ingest in year one
  • 35% annual data growth rate
  • 60 day active data retention

 

3 Year ELK Stack TCO
  Year 1 Year 2 Year 3 Total 3 Year TCO
AWS Compute & Storage $14,057,743 $18,824,305 $25,296,333 $58,178,381
Operations Staffing $528,750 $707,000 $952,700 $2,188,450
Elastic Software Support $1,264,000 $1,700,000 $2,292,000 $5,256,000
TOTAL $15,850,493 $21,131,305 $28,541,033 $65,622,831

 

Once the cost of your ELK stack environment is well understood, the next question is “now what?”.

 

Motivations to Switch from ELK

As ELK customers know, the TCO problem is more than simply a budgeting issue – it has a ripple effect. Whereas it might seem easiest to “throw money at the problem” by continually expanding the budget to cover the cost of the increasingly complex infrastructure environment, and the associated personnel to manage it all, inevitably customers will face painful tradeoffs. In a practical sense, this means the team responsible for the ELK environment will need to find ways to curtail the growth by limiting the amount of data ingested per day, and limiting the data retention rate. Keeping in mind that the centralized log management system provides the “single source of truth” for a given IT environment, decisions to limit the data captured and available for analysis create insight gaps.

Given how vital access to log messages are for all of the use cases that depend on it, these gaps could have devastating effects. Whether investigating a data breach, or conducting trend analysis to determine the infrastructure requirements, gaps in the data can lead to faulty analyses. And the problem with log data is that it’s hard to know what you’ll need until you need it, so when making tradeoff decisions, the team managing the ELK environment is flying blind.

Moreover, as ELK environments grow in size and complexity, they become unstable. When pushed beyond the designed architectural scalability limits, ELK deployments frequently experience outages, which can have a severe negative impact on the operations that rely on them.

 

 

ELK Alternatives For Centralized Log Management

Here are some options users typically consider for centralized log management and observability.

 

OpenSearch

When Elasticsearch changed its licensing model, many users looked for an open source, community-oriented alternative. Even so, the underlying infrastructure for OpenSearch is similar to ELK. That means users face the same challenges with OpenSearch and Elasticsearch — including management complexity issues of the ELK Stack, along with escalating data retention costs.

 

Datadog

Many teams consider Datadog for full-stack observability, and think about log management under the same umbrella. While Datadog is great for observability, application and infrastructure monitoring and real-time alerting, like Elasticsearch, its costs start mounting when it comes to retaining log data beyond just a few days or weeks. 

Read more: 3 Straightforward Pros and Cons of Datadog

 

CloudWatch

Many Amazon customers consider CloudWatch as a viable alternative to ELK for log analytics and troubleshooting, since it is well-integrated with other AWS services. However, in a cloud-native environment, the sheer volume of data may cause it to become unwieldy. It lacks the data integration depth and correlation features necessary to recognize very complex patterns or perform root-cause analysis across larger and multiple data sources.

Read more: Going Beyond CloudWatch: 5 Steps to Better Log Analytics

 

Splunk

For security use cases, many teams use Splunk for log analytics. These teams may be looking to troubleshoot advanced persistent threats that linger in their network and slowly escalate privileges, often going undetected. Since these threats are long-term in nature, they require long-term data retention to investigate. Like Datadog, Splunk is better for real-time alerting and observability, and data retention costs can add up.

 

Exploring a serverless, stateless ELK alternative

Since many of the issues described above have to do with management complexity and retention costs, cloud-native companies may be better served with a true serverless architecture. While Elasticsearch has taken some steps toward rearchitecting its underlying infrastructure, many of the management complexity issues still remain.

The ChaosSearch Cloud Data Platform is the first serverless ELK alternative that solves the underlying architectural problems inherent in traditional systems, thereby enabling both massive scalability and extraordinary TCO savings.

ChaosSearch takes a fundamentally different approach to search and analytics and therefore represents a new generation of log management platforms.

Whereas ELK and the other traditional log management platforms are “closed” systems, in which data is transformed during the ingest process, and stored within an internal database with its own data format, ChaosSearch simply connects to and indexes data that is already stored by the customer, in the customer’s existing cloud data storage. With read-only access to this customer data in the cloud, ChaosSearch builds a separate index without manipulating or taking custody of the underlying original data.

Although that difference might sound simple, it makes all the difference in the world. On ingest, ChaosSearch introduces no bottlenecks — the data can stream directly into a customer’s cloud storage in its native format. And because it avoids the burden of “data custody”, ChaosSearch has no internal database size constraint. ChaosSearch simply leverages the performance, scale, and economics of the public cloud. This is the key that allows ChaosSearch to deliver unlimited scalability, industry-leading resiliency, and massive time and cost savings.

 

 

Cost Savings Explained

ChaosSearch delivers massive TCO savings in two primary ways. First, the simplified architectural approach that ChaosSearch takes results in dramatically reduced infrastructure requirements. With ChaosSearch, customers need only pay for their cloud object storage environment, and can eliminate all spending on compute and block storage infrastructure associated with the ELK stack environment. Secondly, ChaosSearch is delivered to customers as a managed service, with a single monthly fee based on the daily ingest rate. This SaaS approach reduces the amount of customer operations personnel required to operate the environment down to a fraction of one full-time employee (FTE).

 

Cost Savings Quantified

The best way to understand the cost advantages of ChaosSearch is to conduct an apples-to-apples comparison, demonstrating the TCO of ChaosSearch vs. ELK for various customer scenarios.

 

 

The summary table below shows the 3-year savings that ChaosSearch delivers over the ELK stack for each scenario.

 

  Scenario 1 Scenario 2 Scenario 3
ELK Stack 3-Year TCO $2,367,206 $18,841,145 $25,296,333
ChaosSearch 3-Year TCO $528,750 $707,000 $952,700
ChaosSearch Cost Savings vs ELK $1,264,000 $1,700,000 $2,292,000
% Saved vs ELK 64% 63% 63%

 

Assessing Costs Switching from the ELK Stack

No business case is complete if it ignores switching costs.

As every IT veteran knows, IT systems are often “sticky”. Once an IT solution is deployed and used in production for daily operations, it can be very difficult to switch over to a new solution, even if the team assesses a new alternative to be superior.

Switching to a new solution can mean system unavailability, and any new technology has the potential to introduce new risks. In today’s cloud-enabled environment, IT systems are expected to be available 24/7, and disruptive “rip and replace” projects are simply not viable.

Thus, any consideration to switch from one platform to another must have a strong value proposition in which the case to switch is overwhelming. And, importantly, the team must be confident that the migration path from the legacy system to the new one is smooth, and can be done with minimal disruption.

ChaosSearch fits the bill, as it not only delivers a massive TCO savings, but also makes the move from ELK to ChaosSearch seamless. Because ChaosSearch natively includes Kibana — the ELK stack visualization tool — and supports the Elasticsearch API, customers can move their operations over to ChaosSearch quickly and easily. And ChaosSearch’s SQL API means customers can plug in other visualization tools (e.g. Looker, Tableau) as well. In addition, customers can search in natural language, using generative AI.

Simply put, customers can maintain the same dashboards, visualizations and pre-staged queries that they use in their ELK environment, simply by exporting them from ELK and importing them into the ChaosSearch deployment.

Customers typically deploy and run ChaosSearch in parallel to their existing ELK stack environment, allowing for an orderly process of testing and moving workloads to ChaosSearch over time. The duration of the transition is based on the number and variety of the workloads to be migrated. While some customers pursue rapid transitions, moving all workloads to ChaosSearch within a two-week period, the average transition time we see is 30-60 days. The cost of this migration period can be quantified – it is simply the cost of maintaining the existing system as adoption ramps up with ChaosSearch. These costs should be added to the business case, as part of the Year One costs of the ChaosSearch TCO. This can be considered the initial investment required that allows you to realize the overall TCO savings of the three-year period.

 

 

The Business Case for Switching from ELK to ChaosSearch

Unlike many IT projects, in which the benefits of the new solution can be difficult to quantify, the business case for ChaosSearch is built on clear-cut, easy-to-quantify deltas between the cost of ChaosSearch and the equivalent ELK stack environment. Calculating the difference in TCOs shows the total cost savings over the three-year period, and allows you to demonstrate the three-year rate of return. The chart below shows both the ROI and rate of return for the Medium Size customer scenario:

 

Medium Size Customer Scenario
3-Year ELK Stack TCO $18,800,000
3-Year ChaosSearch TCO $6,900,000
45 Day Transition Cost $498,000
3-Year ROI $11,402,000
3-Year Rate of Return 165%

 

The paper, and the accompanying TCO tool allow you to construct a similar business case for your specific scenario.

 

Search, Analyze, and Visualize Better

Customers managing medium-to-large sized ELK environments today face ongoing cost increases, while struggling to maintain uptime in an overly complex environment. Meanwhile, they must continually make tradeoffs that limit access to data for the groups that rely on log analytics in their daily operations. ChaosSearch provides the ideal replacement for the ELK stack, given that it delivers massive reductions in cost and complexity, solves the scalability problem, and enables a seamless transition. Quantifying the benefits of ChaosSearch allows you to build a rock-solid business case for making the switch.

If you’re managing an ELK environment today, or are involved in log analytics that rely on the underlying ELK stack, now is a good time to consider a change. The ChaosSearch team can assist you in using the TCO tool to develop a customized business case for your environment. We’d like to hear from you about your plans and any unique challenges you are facing.