Any data-driven organization will tell you that the holy grail is faster time to insights. But the unfortunate truth is that business users often have to wait days — even weeks or months — to analyze the data they need. Behind the scenes, data engineering teams put a lot of work into joining disparate datasets, creating pipelines, and delivering a final data product back to their stakeholders for analysis. Even then, the data may not quite deliver the answers they need, starting the cycle all over again.
What if there were a way to speed up this process and make it easier to leverage telemetry data for continuous improvement?
The practice of DataOps may hold the secret to accelerating data management and analysis, with the goal of improving products and services across the organization. In a product-led growth (PLG) organization, it’s critical to understand telemetry across your applications and infrastructure. While this telemetry is constantly changing, it can ultimately be used to improve the customer experience and drive business growth.
Let’s dig in to answer the question, “What is DataOps?” and how do you apply it to make faster, data-driven decisions?
According to CIO Magazine, DataOps is a collaborative data management practice, which applies an agile methodology to developing and delivering analytics. Typically, DataOps brings together DevOps with data engineers and data scientists to provide the tools, processes and organizational structure needed to become a truly data-driven organization. When done well, DataOps can make it much easier to develop and maintain applications, and drive improvements for customers.
DataOps and DevOps were born out of similar core principles, but they’re not exactly the same. However, just as DevOps was created as a way to accelerate and improve software development and delivery pipelines, DataOps’ goal is to speed up analytical pipelines and improve data quality. Some DevOps team members may find themselves a part of a DataOps team, especially if you’re leveraging application telemetry data, like log data, for the benefit of continuous improvement.
According to McKinsey, like DevOps, DataOps requires a mindset shift that incorporates people, processes, and technology:
DataOps should enable teams to generate more value from their data, by automating some of the manual processes associated with data ingestion, processing, modeling, and delivering insights to the end user. The ultimate goal of DataOps is to accelerate time to value from data, empowering teams to access and react to data faster.
READ: Unlocking Data Literacy Part 1: How to Set Up a Data Analytics Practice That Works for Your People
Successful DevOps teams use continuous integration/continuous delivery (CI/CD) pipelines to communicate efficiently and ship products quickly. Logs and event data are generated at every stage in the application development lifecycle. This log data can be transformed and analyzed in many ways, including:
Ultimately, telemetry data can deliver deeper web application and mobile application insights for product-led companies — including metrics such as:
All of this sounds simple enough, so why would an organization need DataOps to begin with? In practice, ingesting telemetry data into traditional analytical databases is a challenge in and of itself. It’s time-consuming and tough to manage, not to mention costly to retain log data for longer than 30 days. As a result, many teams make a tradeoff between data retention and cost. This is one of the first challenges. Without access to longer-term log datasets, teams may miss out on lingering application problems or even advanced persistent security threats.
As a result, log data often remains locked within IT teams and data warehouses. There’s complex data engineering work that must be done to access and analyze this data, including Extract, Transform and Load (or ETL) pipelines that make the source data fit a predefined structure or schema. Data engineers spend a lot of time defining these schemas, so their data can be consumable for business users via their analytical tools of choice. However, not all data meets these predefined use cases, and it can be tough to know exactly what patterns users want to discover in their data.
Organizations that use data lakes, on the other hand, run the risk of accumulating a “data swamp” since you can store all kinds of data formats (structured and unstructured) without having to undergo the ETL process upon ingestion. Even so, data lakes require data engineers or data scientists to sift through all of the multi-structured data sets, and they require integration with other systems or analytic APIs to support BI.
READ: 5 Best Practices for Simplifying Data Management
DataOps is all about removing data silos and bottlenecks. So, if your business users are constantly stuck waiting on data engineering or are frustrated by the lack of ability to interrogate data in new ways, DataOps may be for you.
Focusing on the technology aspect, many DataOps teams choose to implement a best-of-breed observability approach composed of open source tools and open APIs. The tools in this observability stack should incorporate automation to be scalable, prioritize security and governance, and leverage open standards for interoperability and future-proofing. Openness is important since new data architectures are always emerging, and components are continually changing.
When it comes to log data, specifically, it’s usually stored as a semi-structured text file, like CSV or, increasingly, JSON. Because these text files are flexible in terms of schema and the exact questions that users ask of them aren’t known in advance, a document store and text-based search are a perfect fit for log observability and investigation. Some of the traditional toolings, such as Apache Lucene and OpenSearch, can be expensive to scale. However, using open APIs, you can integrate tools you already use, like OpenSearch Dashboards (Kibana), with flexible and cost-effective analytics solutions like ChaosSearch.
With ChaosSearch, there is no data movement, transformation, or schema management required, reducing the management effort required for DataOps. The solution cleans, prepares, and virtually transforms data directly within low-cost cloud object storage, like Amazon S3. Users can search telemetry data directly within S3 via tools like Trino or Superset without any data movement or complex data pipelines, reducing time to insights. Overall, having the right technologies in place can reduce the effort and time spent on manipulating data, allowing your users to get the answers they need faster.
Read the Brief: Scalable Log Analytics for CloudOps and DevOps teams
Listen to the Podcast: Trends and Emerging Technologies in Data Analytics
Check out the eBook: Beyond Observability: The Hidden Value of Log Analytics