ChaosSearch Blog - Tips for Wrestling Your Data Chaos

Log and Event Analytics on Databricks - Everything You Need to Know

Written by David Bunting | May 30, 2024

 

Built on the foundation of Apache Spark, Databricks is a unified, open data lakehouse platform that empowers customers to efficiently and cost-effectively process, store, manage, and analyze large volumes of enterprise data.

The Databricks platform provides a unified interface and tools that primarily support data engineering, data science, and BI applications, but Databricks can also be used to support log and event analytics use cases like security operations, cloud observability, and user behavior analysis.

In this blog, we’re taking a closer look at Databricks platform features and capabilities, how log and event analytics on Databricks typically works today, and the challenges enterprises face when implementing log and event analytics on Databricks.

 

 

What is the Databricks Platform?

To understand how the Databricks platform can support log and event analytics use cases, we need to understand the features and capabilities of the Databricks platform. Let’s start with a short overview of Databricks architecture, concepts, and components.

 

Databricks Combines Data Lake and Data Warehouse Capabilities

As a data lakehouse solution, the Databricks platform combines data lake storage with data warehouse analytics. Databricks customers get the storage capabilities of a data lake (e.g. easy data ingestion, scalable storage, decoupled storage and compute, and support for unstructured and semi-structured data) and the analytical capabilities of a data warehouse (e.g. schema-on-write, ACID transactions, and fast SQL/relational querying) in a single unified platform.

Databricks integrates directly with cloud object storage (e.g. Amazon S3, Google Cloud Storage, or Azure Blob Storage), allowing customers to establish an open data lake for storing structured, unstructured, or semi-structured enterprise data.

Data that enters the customer’s cloud object storage is converted to Delta Tables and stored in Delta Lakes. From there, Databricks customers can manage and catalog the data, configure ETL pipelines to apply transformation or process the data at scale, build data warehouses and data marts inside Delta Lakes, and execute SQL/relational queries on their data.

 

Image Source

Databricks data lakehouse platform combines data lake storage (easy data ingestion, scalable storage, decoupled storage and compute, and support for unstructured and semi-structured data) with data warehouse analytics (built-in scalable data processing, ACID transactions, and fast SQL/relational querying).

 

Databricks Lakehouse Platform Overview

Now let’s take a closer look at the Databricks Lakehouse Platform and the software technologies that power it.

Unlike other cloud data platforms like Snowflake, Databricks does not require its customers to migrate their data into a proprietary storage system. Instead, Databricks integrates directly with the customer’s cloud account, deploys compute clusters powered by Apache Spark using the customer’s public cloud resources, and stores data in the customer’s cloud object storage.

 

Image Source

Databricks lakehouse platform architecture and component technologies.

 

1. Cloud Data Lake

Databricks requires its customers to create an open data lake by ingesting structured, unstructured, and semi-structured data from a variety of sources into their cloud object storage. Customers can use data ingestion software tools like StreamSets, Fivetran, Informatica, or Qlik to stream or batch ingest data from databases, applications, and/or file storage systems into cloud object storage.

 

2. Delta Lake

Delta Lake is an optimized storage layer that unifies all data types for transactional, analytical, and AI use cases. Raw data that enters the data lake is automatically converted to Delta Tables and stored in the Bronze layer of Delta Lake. This storage format uses Parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling.

Databricks customers can use Delta Live Tables to build and automate data pipelines that clean, transform, and process raw data from various sources to prepare it for downstream analytics applications.

Delta Lake follows a 3-tier medallion architecture, where newly ingested data is filtered, cleaned, augmented, and transformed by data pipelines to generate clean, validated data for downstream users and applications. Data in the Bronze layer is raw (newly ingested), while data in the Silver layer has been cleaned and transformed, and the Gold layer consists of curated business-level tables.

 

Image Source

Inside Delta Lake, raw data is cleaned, validated, and transformed to produce curated business-level tables that are ready for downstream applications and users.

 

3. Unity Catalog

Unity Catalog is Databricks unified governance layer that provides security, compliance, and governance features across all data and AI workloads. With Unity Catalog, Databricks customers benefit from centralized access controls, the ability to track data lineage for auditing purposes, data discovery capabilities with Catalog Explorer, and seamless open data sharing via Delta Sharing.

 

4. Data Warehousing

Databricks SQL is a collection of services that provide data warehousing capabilities to the data lakehouse. Databricks provides an interactive query interface that supports ANSI SQL syntax, allowing customers to execute complex queries on data stored in Delta Lake. Databricks customers may build a data warehouse by integrating cleaned data in the Silver layer of Delta Lake, while the Gold layer may contain one or more data marts with highly curated and refined data for specific applications and users.

 

5. Data Engineering

Databricks delivers a collection of data engineering capabilities that help customers design, build, and manage data processing infrastructure. The most prominent of these is Delta Live Tables, a framework for building data processing pipelines. With Delta Live Tables, customers define streaming tables and materialized views that should be kept up to date. Delta Live Tables can update these assets by automatically managing the flow of data, simplifying ETL workflows for data engineers.

 

6. Data Streaming

Databricks offers capabilities for near real-time data ingestion, processing, ML, and AI for streaming data. Auto Loader can automate the ingestion of streaming data from various sources, while Structured Streaming allows for near real-time data processing and enables SQL-like queries on live data streams. Delta Live Tables can also simplify the creation and management of ETL pipelines that incrementally ingest raw data to the Bronze layer of Delta Lake.

 

7. Data Science and ML

Databricks Mosaic AI provides tooling that enables Databricks customers to build, deploy, and monitor Gen AI, large language models (LLMs), and ML solutions. Inside Databricks, customers can prepare data for model training, train and register ML models, host open-source LLMs, monitor deployed AI models in the production environment, and more. With Databricks ML capabilities, customers can train deep learning models or build recommendation or prediction engines using their data.

 

Implementing Log and Event Analytics in Databricks

With Databricks, customers get a unified platform that enables data lake storage, security and governance, data engineering, and data warehousing, with built-in support for SQL querying as well as AI and ML workloads.

With this understanding, we can start to look at how these features and capabilities could be utilized along with other tools to enable three log and event analytics use cases: security operations and threat hunting, cloud observability, and user behavior insights.

 

Security Operations and Threat Hunting

Security operations and threat hunting are critical aspects of enterprise cybersecurity underpinned by security log analysis capabilities. Security monitoring involves the continuous monitoring of security logs, including system activity and user/network access logs to detect and alert on anomalous behavior. For effective threat hunting, security analysts need the ability to proactively search through log data for Indicators of Compromise (IoCs). These applications require capabilities that are not provided by Databricks, so customers will need additional software tools to support their cybersecurity needs.

 

Image Source

Databricks Cybersecurity Solution

 

With Databricks Cybersecurity Solution, customers can ingest security log data into an open data lake backed by cloud object storage. Next, customers can harness Databricks data engineering capabilities to pipeline the data into Delta Lakes and apply schema. Once the data has been prepared, it can be passed to a 3rd-party tool like Hunter, a SOC platform designed to ingest, index, investigate, and correlate security log data at scale.

 

Cloud Observability

Cloud observability is the capacity to gain insights into the status, health, behavior, and performance of cloud-based applications and services through monitoring, logging and analysis. Databricks natively supports popular logging formats like JSON and text-based logging.

Databricks customers can use data ingestion tools to collect log data from cloud applications and services and centralize it in Delta Lakes. Logs that have been ingested may be processed through Delta Lake’s medallion architecture and analyzed with Databricks SQL, but cloud engineers may also want to export logs to a 3rd-party observability platform (e.g. Datadog or Splunk) with additional features like real-time infrastructure/application monitoring, log management, metrics, alerting, and/or anomaly detection.

 

User Behavior Insights

Enterprise DevOps teams monitor user behavior inside web and mobile applications to better understand their users and enable data-driven decision making with respect to new feature implementation, bug fixes, and remediating customer pain points or bottlenecks.

 

Image Source

Databricks Composable Customer Data Platform (CDP).

 

In the CDP architecture shown above, customers use SnowPlow to define custom events and gather log and event data from multiple sources (mobile and web apps, customer-facing cloud services, etc.), Databricks to integrate and correlate user behavior logs from multiple sources into curated tables, and Hightouch to activate the data for marketing, sales, and customer success teams.

This architecture is designed to produce a “360” view of customer data by integrating and correlating data from CRM and advertising with user behavior data from the cloud. A solution focused on user behavior analytics might replace Hightouch with a 3rd-party tool like Splunk UBA, LogRhythm, or Exabeam.

 

 

The Key Challenge of Log and Event Analytics on Databricks

Databricks is best known for its data engineering, processing, and data science capabilities that make it easy for enterprise data teams to ingest, process, and store large amounts of data before operationalizing it in AI/ML applications or building data warehouses to enable SQL querying.

All of these capabilities can be leveraged inside the customer’s cloud object storage without moving the data into a proprietary storage platform, which would result in high data egress fees, high storage costs, and undesirable data retention trade-offs at scale.

But when it comes to using Databricks for log and event analytics, customers who want to leverage the deeper capabilities of specialized third-party tools will often need to ship data outside of their cloud object storage and into those platforms. This can result in high data egress and storage costs, especially for organizations that generate large volumes of log data.

Enterprises may attempt to reduce short-term data storage costs by restricting the data retention window in third-party tools, but this also reduces the viability of long-term log analytics use cases like application performance trend analysis or security incident root cause investigation.

 

Power Log Analytics Use Cases with ChaosSearch on Databricks

ChaosSearch can now be deployed in Databricks to deliver our simple, powerful, and cost-effective log analytics capabilities on the Databricks platform.