Organizations are leveraging log analytics in the cloud for a variety of use cases, including application performance monitoring, troubleshooting cloud services, user behavior analysis, security operations and threat hunting, forensic network investigation, and supporting regulatory compliance initiatives.
But with enterprise data growing at astronomical rates, organizations are finding it increasingly costly, complex, and time-consuming to capture, securely store, and efficiently analyze their log data.
In a recent survey of 200 IT professionals, data engineers, and data scientists at enterprises with 1000+ employees, more than 20% of respondents said their organizations were pulling data from 1,000+ different sources to power their BI and data analytics initiatives (Matillion).
That’s a huge amount of log data that needs to be collected, stored, and indexed before it can be analyzed.
As organizations extend their presence in the cloud and generate an increasing amount of data (e.g. event and security logs, etc.) each day, data lakes are once again being considered by CIOs as an attractive option for data management, storage, and analytics at scale. Data lake solutions can be architected to support cost-effective data storage at scale and expanded data access within organizations (also known as data democratization), leading to increased data utilization, insight generation, and value creation.
In this blog post, we’re taking a closer look at three types of data lake architecture powering the future of enterprise log analytics. We will:
Let’s dive in!
It was around 2010 when the term Data Lake was first coined by James Dixon, then-CTO of Pentaho Corporation.
At this time, organizations involved with big data analytics were utilizing data warehouses for large-scale storage of processed data.
Data marts were also deployed, enabling individual business units to access warehoused data pertaining to their department.
From Dixon’s perspective, data marts were preventing organizations from reaching their full potential for big data utilization.
Data and information was siloed because each department could only access data in their own data mart, while other areas of the data warehouse remained opaque and inaccessible.
Data marts were also accused of stifling innovation because they presented users with structured data derived from raw data - but not the raw data itself.
A lack of access to raw data, Dixon believed, prevented users from transforming the data in alternative ways to extract new insights or develop new use cases.
Dixon’s concept of a data lake is based on the idea of storing data in its raw form and broadening data access to break down silos and accelerate innovation. Here are Dixon’s own words describing how this concept would differ from a data mart:
“If you think of a Data Mart as a store of bottled water, cleansed and packaged and structured for easy consumption, the Data Lake is a large body of water in a more natural state. The contents of the Data Lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.” (James Dixon)
Today, a data lake is defined as a data storage repository that centralizes, organizes, and protects large amounts of structured, semi-structured, and unstructured data from multiple sources. Unlike data warehouses that follow a schema-on-write approach (data is structured as it enters the warehouse), data lakes follow a schema-on-read approach where data can be structured at query-time based on user needs.
As a result, organizations using a data lake have more flexibility to analyze their data in new ways, develop meaningful insights, and uncover valuable new use cases for enterprise data.
Data lakes can be designed and architected in different ways. Integrating with existing enterprise software tools, they can deliver the capabilities that help companies store and analyze log files at scale.
Based on these four functions, we identify the key components of data lake architecture.
In addition to these five essential components, a data lake solution should also provide mechanisms for implementing data governance policies. These policies can include things like metadata management, data lineage, or role-based access controls.
How do different data lake solutions incorporate key architectural components to deliver on these critical functionalities?
Most vendors have adopted one of three main approaches to data lake architecture.
In 2019, AWS released a new solution known as “Data Lake on AWS”.
This solution uses a template-based approach that automatically configures existing AWS services to support data lake functionality, such as tagging, sharing, transforming, accessing, and governing data in a centralized repository.
The template approach championed by AWS cuts down on manual configuration, allowing users to set up their data lake in as little as 30 minutes.
Image Source: Amazon Web Services
In this architecture, users access the data lake through a console secured by Amazon Cognito (user authentication service). Data is ingested via services like Amazon CloudWatch that capture log and event data from throughout the cloud environment. The raw data is stored in Amazon S3, while metadata is managed in DynamoDB. Data is cataloged with AWS Glue and can be searched using the Amazon OpenSearch service or analyzed with Amazon Athena.
A template-based approach can make data lakes easier to configure, but complexity and IT management overhead are significant issues with data lake architectures that have so many moving parts. Enterprise data engineers may choose to start with a template-based approach, then later improve or optimize the AWS data lake by introducing components and services from 3rd-party vendors.
Read: 10 AWS Data Lake Best Practices
A second approach to enterprise data architecture involves combining the features of a data warehouse and a data lake into a hybrid architecture that’s been termed a “Data LakeHouse.”
Platforms like Databricks and Snowflake use this type of architecture, as do some data warehousing services like Google BigQuery and AWS Redshift Spectrum.
Image Source: Medium
In the LakeHouse architecture shown above, you’ll notice many of the same components we’ve already mentioned:
The lakehouse approach has its benefits, but it also introduces a high level of complexity that can result in poor data quality, and performance degradation.
High complexity also makes it challenging for non-IT users to utilize data, ultimately preventing organizations from reaching the promised land of data democratization.
Watch: The Rise of Data Mesh and Data Fabric Architectures
The third approach - and also our favorite - is what we’d call a cloud data platform architecture.
In this set-up, a self-service data lake engine sits on top of a cloud-based data repository, delivering capabilities like data indexing, transformation, analytics, and visualization that help organizations efficiently manage and analyze their data at scale.
Image Source: ChaosSearch
In the data lake architecture reimagined here, data is produced by applications (either on-prem or in the cloud) and streamed into cloud object storage (e.g. Amazon S3 or Google Cloud Storage) using cloud services like Amazon CloudWatch or an open source log aggregation tool like Logstash.
ChaosSearch runs as a managed service in the cloud, allowing organizations to:
ChaosSearch delivers a simplified approach to data lake architecture that unlocks the full potential of Amazon S3 as a large-scale storage repository for enterprise log data.
Read: The New World of Data Lakes, Data Warehouses and Cloud Data Platforms
As enterprise log data continues to grow, organizations will need to start future-proofing their log analytics initiatives with data storage solutions that enable log data indexing and analysis at scale.
Data lakes are a natural fit here – they can ingest large volumes of event logs, ramp up log data storage with limitless capacity, index log data for querying, and feed log data into visualization tools to drive insights.
The example architectures we featured here all deliver on these core capabilities – but only cloud data platforms offer a fully optimized architecture that reduces management complexity and minimizes technical overhead.
If you’re on your way to producing more event logs than you can analyze, or if you’re already there, it’s time to think about an enterprise data solution that delivers hassle-free and cost-effective performance at scale.
For more insights on data lake architecture and the future of log analytics, view our free recorded webinar Advanced Analytics - Data Architecture Best Practices for Advanced Analytics.