Guest post by Kevin O’Rourke, Principal AWS Solutions Architect and Practice Leader, CTO and Co-Founder at JetSweep
A Traditional Approach to Information Architecture is No Longer Viable
Enterprise data warehousing has fundamentally shaped our way of thinking in regard to information data management and analytics. For more than 30 years, organizations have invested in data integration and analytics platforms to achieve competitive advantage thru quantifiable and strategic benefits. All analytics platforms must be performance-driven.
A traditional approach to information management has become problematic in terms of overall costs, and more importantly an inherent inability to adapt to emerging changes in the proliferation of data, new and emerging technologies and cloud-based integrated service platforms.
From a cost perspective, traditional data warehouses rely on relational databases (RDBMS) as a primary storage option. Initial capital investment for hardware, software licensing for databases, data integration and analytics platform alone. Over 70% of development costs from ETL include effort to consolidate, prepare, standardize, and transform data for downstream analytics.
Traditional data storage and analytic tools can no longer provide the agility and flexibility required to deliver relevant business insights and competitive advantage.
Based on a study published by IDC Digital Universe, April 2016, trend shows the challenge of data coming from everywhere. Data of every variety, volume, and velocity. Highly diversified scenarios require a variety of integrated tools and services for storage, process, and compute requirements tailored to a variety of data consumer usage cases.
Further and not unexpected, trends show the widening gap of enterprise data available vs. data managed within a data warehouse. Traditional architectures with an RDBMS-only mindset cannot accommodate more advanced forms of data processing. Many organizations are shifting to a data lake architecture.
So Why Do We Need Data Lakes?
A data lake is an architectural approach specifically designed to handle data of every variety, ingestion velocity, and storage volume. Conceptually, a data lake would allow the storage of massive amounts of data into a central location so it’s readily available to be categorized, processed, analyzed, and consumed by diverse groups within an organization. Since data can be stored as-is, there is no need to convert it to a predefined schema, as typically required in traditional RDBMS-driven architectures.
The metaphor itself of a data lake to a natural lake as a body of water, sourced by many streams of water from anywhere is very effective. As shown in Figure 1, data will typically need to be prepared for the widest variety of consumer usage patterns. Consumer usage patterns and the sourcing of the data itself directly influence how data is collected, stored, processed, moved, transformed, automated, and visualized. Data is the ultimate asset with boundless usage patterns, now being generated and consumed by humans, machines, devices, sensors, and applications.
Building a Data Lake on Amazon Web Services (AWS)
A data lake is an architectural approach specifically designed to handle data of every variety, ingestion velocity, and data volume. Data preparedness directly influences choice of storage, data movement, infrastructure choices, and analytics services. The process for preparing data for its intended use requires a toolbox mentality in terms of integrated services, tools and platforms. Figure 2 represents some of the service areas in play when solutioning use cases for a data lake.
The Data Lake is very much a disconnected storage concept at various levels of data preparedness. Data is stored in a raw state initially, and some use cases will use raw data as is. More often, solutions required varying degrees of data preparedness based on a collection of query usage profiles which correlate to our use cases. Based on the solution, data may be refined and staged with the intent to promote modularity and reuse (not over process the data set because it is intended for multiple purposes downstream, such as AWS RedShift for relational analytics, AWS Elasticsearch for text search, or an optimized distributed file system for low cost active archive storage which can be queried with an MPP SQL engine).
In Part 2 of this series, we will talk about how AWS S3 plays such an important role as a data lake for AWS customers and steps for better managing your data throughout its lifecycle.