As the volume of data produced globally is on a continuous growth trajectory, it shouldn't come as a surprise to anyone that under the exceptional year we had in 2020, it reached a staggering 59 zettabytes according to estimates by the IDC. To put that in familiar terms, if each and every human on earth continuously streamed their lives in full HD throughout the year, we'd still fall about 4 zettabytes short of that value. Moreover, the volume and scale of data generation, in addition to the observed trends, led IDC to predict that by 2025, most of that data will come from the "unstructured data" segment - think tweets, posts, audio/visual content, rather than your conventional spreadsheets and neatly organized database tables. Being able to store and derive timely insights from such data yields a differentiating advantage for enterprises competing in the age of Industry 4.0.
What is structured, semi-structured and unstructured data?
Since data comes in varying levels of structure, having a comprehensive and adaptive approach to account for such complexity is no doubt a cornerstone for a successful enterprise data management strategy. Let's first examine the different considerations in data structures.
If you think of data structures as a continuum, then at one extreme end, we have our neatly structured data, which refers to data that is already transformed and organized in a tabular format and is often stored as a relational database. This makes it easy to query, extract insights, and report on it with minimal processing. A defined schema is at the core foundation of structured data allowing for robust structures that are rigid and hard to scale. Such structure is also built into the way the data is processed and transformed prior to storage, entailing more upfront work in cleaning and transformation as part of your classical Extract Transform and Load (ETL) philosophy.
At the opposite end, unstructured data does not conform to a predefined model or structure. This data type encompasses a wide range of examples including logs, social media feeds, raw files, natural-language text, etc. The lack of pre-defined structures makes unstructured data flexible and easier to scale, but at the same time, harder to process and query to extract valuable and timely insights.
Semi-structured data sits in the middle of the continuum. This data type is often based on Extensible Markup Language (XML), JavaScript Object Notation (JSON), Delimiter-Separated Value (DSV), etc. Semi-structured data contains elements from both structured and unstructured data. It does not conform to a pre-defined rigid structure, instead the structural elements in a semi-structured data are self-contained in the form of semantic tags and metadata, making it easier to process and organize than unstructured data.
Figure 1 - Examples of structured, semi-structured and unstructured dataWhat advantages do data lakes offer compared to data warehouses?
Leveraging the power of big data regardless of the structural nature begins with solving the efficient storage and retrieval problem. Traditionally, data warehouses have been the medium of choice as they present a fully governed storage for structured tabular data. The concept of data warehouses was first developed by Bill Inmon in the 1970s, but the architecture was developed and commercialized later in the 1980s.
Data warehouses are essentially central repositories of information that are optimized for ease of querying and extracting insights for reporting. This makes warehouses a great storage solution for enterprises seeking to operationalize structured data in a timely fashion. Data warehouses tend to be more costly and require data to undergo stringent transformation and structuring in compliance with a predefined structure before it's stored. This leads to a data wastage problem either at the selection stage, where only salvageable and transformable data is kept at the expense of non-traditional data structures, as well as during the ETL process where data undergoes a funneling process where only the transformed results are stored.
Data lakes on the other hand, emerged as a concept in 2010 to address the need for a more flexible data storage medium that can host data across the various structural states. James Dixon, the founder of Pentaho coined the term and defined it in contrast to the more structured data marts (often considered a subset of a data warehouse):
"If you think of a Data Mart as a store of bottled water, cleansed and packaged and structured for easy consumption, the Data Lake is a large body of water in a more natural state. The contents of the Data Lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples."
James Dixon, founder and former CTO of Pentaho
One thing to note here, is that both the data warehouse and data lake are analogies that are based on physical objects, and while they hold true for most cases, they tend to breakdown at a certain point. In the case of a data warehouse, and similar to how you would use a warehouse for storing finished goods rather than raw material, your data warehouse is meant to store report-ready robustly structured data. Jumping back to physical warehouses in the real world, best practices in supply chain management would put more emphasis on optimized processing for more Just-In-Time driven production rather than piling up finished goods and taking up valuable and costly storage space in warehouses.
On the other hand, the data lake analogy is meant to reflect a certain native and structure-agnostic philosophy that does not have stringent built-in requirements for processing or quality upon storage. This is where the analogy tends to break as in reality, an efficient data lake is built with a purposeful semantic and governance structure that transcends the oversimplified just-in-case storage mindset which would otherwise encourage data hoarding practices and result in data swamps rather than lakes.
The flexibility and versatility offered by a well governed data lake as a unified repository for data storage manifests itself in the following aspects:
- Flexible analytics: data stored, as-is, without any transformations or alteration allowing for maximum flexibility in extracting insights. This shifts the common ETL paradigm into ELT where data is stored in its valuable raw state with data conformity rules and structures applied on-demand and according to the needs of a business case.
Figure 2 - From ETL to ELT
- Mixed-structure data: able to store structured, semi-structured and unstructured data.
- Scalability and elasticity of resource allocation: As compared to conventional data systems, data lakes predominantly benefit from a decoupled storage and compute model which allows each to be independently scaled and optimized for costs - be it performance or capital.
- Different access levels: data analysts, data scientists, DBAs and operational users can all tap into the data stream and lake for analytics and insights.
- Security and governance: at storage, data can be secured through encryption, with regulated and authenticated access.
Consequently, whether you are purposeful about storing all your enterprise data for immediate reporting and data science applications or are simply capturing all of it for future value extraction, data lakes should be a central consideration for your data management strategy. A well-managed data lake allows you the ability to consolidate otherwise disparate and ever-growing sources of data with various levels of complexity in a single flexible, cheap, secure, scalable, and accessible storage space.
And while a data lake is fundamentally different from a tradition enterprise data warehouse, it is not necessarily meant to replace it altogether. In fact, most organizational use cases would require having both in place. In such context, the use of enterprise data warehouses is often limited to Business Intelligence or high-intensity extraction of relational data. Various architectures have emerged in recent years to account for this duality - from simply having an ETL outlet to a data warehouse, to a more governed approach relying on equipping data lakes with data warehousing capabilities introducing what came to be known as a data lakehouse. Such architectures are essentially hybrid structures built on a data lake foundation.
Khalid Marbou | Sr. Digital Strategist - Infor OS Data Fabric
Learn more about Infor Data FabricLet's Connect
Contact us and we'll have a Business Development Representative contact you within 24 business hours.