Performance Comparison Series Part 1: Elastic vs. Redshift vs. CHAOSSEARCH
Along with cost and feature set, performance is always a common decision metric among prospective users choosing between log analytics solutions. Those who have used Elasticsearch want to maintain the snappy response times for text search requests. Others using relational databases or data warehouses are concerned with how those relational-type queries perform within a non-relational system. We’ll get into those types of comparisons in later articles. For now, we will focus on something important to everyone: ingestion speed and size on disk. Ingestion speed will impact how long it takes to migrate to a new solution, as well as how much that solution will have to scale for incoming data. Size on disk translates directly to cost and retention period.
Our focus from the start has been building a high performance distributed architecture that can take advantage of modern cloud computing design patterns, most notably scaling in/out with different sized workloads. However, our underlying data fabric and unique data representation, called Data Edge, is not a brute force solution – we don’t just throw horsepower behind it. Data Edge was designed from the ground up to be fast and compact, not just one use case, but several. In order to maintain this performance through product iterations, we routinely benchmark against other technologies.
Our goal is to be:
- As fast as an RDBMS for data ingestion
- On par with an RDBMS relational operations (order, join, group by)
- As fast as Elasticsearch for text search
- As small on disk as a column store
- Lower cost of ownership than everything else
Of course, these are just general guidelines. Data formats, underlying hardware, software versions, and lots of other considerations come into play. In this series of blog posts, we’ll look at different aspects of performance in comparison to other technology. We’ll be as transparent as possible and show how each test was run. We will also note when a comparison is not apples-to-apples, as sometimes this isn’t possible or is technically difficult to achieve. This first post will focus on data ingestion and data size.
Data ingestion is the process of receiving information and organizing it. Typically in a storage system or analytics platform, this involves indexing the data and writing it out in a special format. In our case this is the Data Edge format. Data Edge’s unique properties make it perfect for both relational queries and text search. It’s also very small; Data Edge can represent an original object at about the size of a compressed version of the same object. For many common data sources, we often see a 5x reduction in size. This is in contrast to most other solutions on the market, which typically involve a larger tradeoff in time, space, and cost.
Here we will compare both the ingestion rate and size on disk to Amazon Redshift and Elasticsearch, two databases built for very different use cases.
Amazon Redshift
Amazon Redshift is AWS’s high-performance data warehouse offering. It is typically used for online analytics processing (OLAP). Organizations that need to run large complex queries will often pull historical data into a Redshift cluster for analysis rather than attempt such resource-intensive operations on a production online transaction processing (OLTP) database such as Aurora. Redshift’s column-oriented architecture makes it ideal for these types of analytics. This also allows a high rate of compression because like data (per column) is stored in the same blocks of memory. Thus, a common use case with Redshift is a large data ingestion, followed by some intensive analytic queries. As a user of Redshift, you need to be aware of what types of queries you are going to run. The system allows you to specify different types of keys which affect how memory is laid out. This is unlike a row store such as MySQL where you can simply add secondary keys and generally see increased query performance (at a cost to ingestion and data size).
Elasticsearch
Elasticsearch was built for fast, full text search. It is used to implement search engines, and more recently, log analysis platforms (although this was not its original intended purpose). It uses what’s called an Inverted Index to represent documents. This inverted index makes text search very fast. The tradeoff is with data size and ingestion — Elasticsearch is known for slow ingestion rates and a very large data footprint. Fully indexing a document can sometimes result in a storage representation that is five times the size of the original.
The Data Set
When benchmarking we use data that is as realistic as possible for a large percentage of organizations. One data source that many customers are interested in are Elastic Load Balancer (ELB) logs. ELB stands for Elastic Load Balancer. This is Amazon’s fully managed load balancing service. These logs contain information about requests and responses sent to/from a website or web application. We’ve created a script that lets us generate large amounts of fake ELB logs for testing purposes. For this test we created 35 files, each containing 1 million rows, of ELB logs in CSV format. This equates to about 10GB of data. While Data Edge can recognize these files in their native log format, CSV is typically processed faster among different storage systems which makes for better comparison.
The data fields look like this:
datetime ISO 8601 / name of elb / client:port / backend:port / req proc_time / back_proc_time / resp_proc_time / elb_code / back_code / rec_bytes / res_bytes / request / agent / cypher / protocol
And a sample row looks like this:
2015-05-13T23:39:43.945958Z,my-loadbalancer,192.168.131.39:2817,10.0.0.1:80,0.000086 0.001048,0.001337,200,200,0,57,"GET https://www.example.com:443/ HTTP/1.1","curl/7.38.0", DHE-RSA-AES128-SHA,TLSv1.2
Amazon Redshift Test
We provisioned a single node DC2.Large cluster for testing. The DC stands for “Dense Compute”, which are high performance clusters running on SSDs. The alternative DS clusters (“Dense Storage”) are geared toward making working with very large data sets more cost effective by using HDDs. Our DC2 has 2 virtual CPUs, 15GB of memory, and 160GB of storage.
We then create a schema for our elb_logs table that looks like this:
CREATE TABLE elb_logs ( RequestTime DateTime encode lzo, ELBName varchar(100) encode lzo, RequestIP_Port varchar(22) encode lzo, BackendIP_Port varchar(22) encode lzo, RequestProcessingTime FLOAT encode bytedict, BackendProcessingTime FLOAT encode bytedict, ClientResponseTime FLOAT encode bytedict, ELBResponseCode varchar(3) encode lzo, BackendResponseCode varchar(3) encode lzo, ReceivedBytes BIGINT encode lzo, SentBytes BIGINT encode lzo, HttpRequest varchar(5083) encode lzo, UserAgent varchar(500) encode lzo, SSL_Cipher varchar(40) encode lzo, SSL_Protocol varchar(40) encode lzo );
Now for the ETL. Redshift has a great COPY command that allows you to copy certain types of data directly from S3. This transfer is free assuming your cluster and S3 bucket are in the same region. You still need to define schemas ahead of time, but the load can be done like this:
copy elb_logs from 's3://example-bucket' CREDENTIALS 'aws_access_key_id=key_id;aws_secret_access_key=secret_key' CSV TIMEFORMAT 'auto';
The command tells Redshift to copy all the data from example-bucket into the table called elb_logs. You can specify credentials that are allowed access to the bucket inline as I’ve written here, or you can create an IAM role and assign it to the cluster. Best practices in general are to create the role and avoid specifying credentials wherever possible. We need to tell Redshift the type of data (CSV), and we also need to tell it to automatically infer any datetime format(s).
*In our Redshift setup, this load took 5 minutes and 13 seconds.*
In order for Redshift to run queries as efficiently as it can, you need to run both the VACUUM and ANALYZE commands after a data load is complete. This lets the system reclaim unused space, reorganize memory optimally, and update statistics use for the query planner:
VACUUM elb_logs; ANALYZE elb_logs;
*This took about 35 seconds. Combined with our load, total ingestion was about 5 minutes and 48 seconds.*
We are able to get the final data size with this command:
select sum(capacity)/1024 as capacity_gbytes, sum(used)/1024 as used_gbytes, (sum(capacity) - sum(used))/1024 as free_gbytes from stv_partitions where part_begin=0;
*Which gives us a little more than 3GB.*
Elasticsearch Test
For testing Elasticsearch, we pulled the latest open source package (version 6.2 at the time of this writing) down onto a T2.xlarge EC2 instance. This instance is a general purpose machine on multitenant hardware with 16GB of memory and 4 virtual CPUs. It also has 100GB of EBS provisioned IOPS (400) attached storage. The default Elasticsearch configuration was used.
In order to ETL the data into ES, which natively supports JSON documents, we used the latest version of logstash (6.2.4) and FileBeat (6.2.4) both installed and running on the same machine.
First we pull the data down from s3 into a testing directory with the following command:
aws s3 sync s3://example-bucket testing/
Next we start Elasticsearch with the default configuration, and logstash with the following pipeline.conf file:
input { beats { port => “5044” } } filter { csv { separator => “,” columns => [“timestamp”,”elb_name”,”elb_client_ip_port”,”elb_backend_ip_port”, “request_processing_time”,”backend_processing_time”,”response _processing_time”, “Elb_status_code”, “backend_status_code”, “elb_received_bytes”, “elb_sent_bytes”, “elb_request”, “userAgent”, “elb_sslcipher”, “elb_sslprotocol”] } } output { elasticsearch { hosts => [ “localhost:9200” ] } }
By running the following command:
./elasticsearch-6.2.4/bin/elasticsearch & ./logstash-6.2.4/bin/logstash -f ./logstash-6.2.4/pipeline.conf –config.reload.automatic &>
Now all we need to do is start FileBeat which will send the raw logs to logstash. First we change the filebeat.yml file to look like this:
logging.level: warning filebeat.prospectors: – type: log paths: – testing/*.log output.logstash: hosts: [“localhost:5044”]
Note that for the purposes of FileBeat, csv is a log file. We just need to tell it which files to look for, and how it can talk to logstash. Then, we can start it up:
./filebeat -e -c filebeat.yml
FileBeat will now send the data in the specified files to LogStash, which will parse the data into JSON and bulk load it into Elasticsearch.
*With this setup, ingestion took 1 hour and 15 minutes, and had a data footprint of 22.1GB*
CHAOSSEARCH / Data Edge Test
Our last test with Data Edge will use the same T2.xlarge as the Elasticsearch test, with the same set of files in the testing directory. We will bypass the the web UI and run this test directly with our APIs locally. With Data Edge, there is no need to predefine schema, parse, or ETL data. We just need to point our tool at the set of files to index and it will figure out the details automatically:
./chaossumo model –pattern testing/*.log -o elb-group -m outdata/
This tells us to model everything in the testing directory with the file extension .log (our 35 ELB CSV files), name the resulting object group ‘elb-group’ and write the outputted index files to the outdata directory. This is very similar to how our web application works, except it is using local block storage rather than S3 storage.
*Here Data Edge had an ingestion time of 13 minutes and 30 seconds and a total data output size of 1.8GB.*
Summary:
Other Considerations:
Keep in mind that each of these systems was created for different purposes. For example, although the Elasticsearch ingestion takes much longer than redshift, it is fully indexing the text fields for full text search capabilities, which Redshift does not have (and was not designed for). In future articles we will explore different relational and search functionality on each of these systems with the same setup.
Note that the Redshift load is slightly different in that it is fetching the data remotely (although within the same region, and probably the same availability zone). We’ve tested a remote S3 get of these files from an EC2 machine based in the same region and it is significantly faster than the Redshift load, meaning this probably isn’t a bottle neck in the test, but something to be aware of.