The Confusing World of Data Storage: A Primer

Stephen DeAngelis

January 20, 2023

We live in the Digital Age and business consultants are encouraging organizations to transform into digital enterprises. The foundation of this transformation is data, which has become an essential business asset. According to Yossi Sheffi, Director of the MIT Center for Transportation & Logistics, data is an organization’s most valuable asset. He explains, “The well-worn adage that a company’s most valuable asset is its people needs an update. Today, it’s not people but data that tops the asset value list for companies.”[1] Data is being generated at an unprecedented rate; however, to be useful it must be stored and processed. Data expert Yash Mehta notes, “Every data-first company strives to or is already in the process of adopting a self-service business intelligence model. A lot of these companies are still not in a position to make their data fully accessible across their platform and scale it across to all their users across different verticals.”[2]

In order to make data accessible, it must be stored somewhere in some form. Most of us understand what a database or a dataset is but many people are less familiar with other forms of data storage and data management such as a data lake, a data warehouse, a data fabric, and a data mesh. Below I will try to provide a quick primer on some of these concepts.

Types of Data Storage and Data Management

Data Lake. Heine Krog Iversen, Founder and CEO of TimeXtender, notes, “While data lakes and data warehouses are both important Data Management tools, they serve very different purposes.”[3] He continues, “Data warehouses are typically built from data lakes. Data lakes are data repositories that store data in its raw form. Data lakes emphasize data storage rather than Data Management, by allowing data to be stored in whatever format is most convenient at the time of storage. This allows for easier discovery and analysis of data due to fewer restrictions on how data needs to be formatted or structured before being loaded into the data lake.”

Data Warehouse. Iversen writes, “Data warehouses are similar to data lakes in that they support storing data from multiple sources. In fact, data warehouses often combine data from multiple databases and data lakes. However, data warehouses are designed specifically for data analysis purposes, so data needs to be cleansed, formatted, and prepared before being loaded into the data warehouse where it can be queried or analyzed. … You can think of a data warehouse as a ‘clean’ data store where data is carefully separated, cleansed, and structured, allowing you to quickly extract actionable insights. Data warehouses typically also provide Data Governance and Data Management capabilities, along with better security options.”

Data Lakehouse. Communications specialist Stan Gibson explains why a data lakehouse is needed. He writes, “Traditionally, organizations have maintained two systems as part of their data strategies: a system of record on which to run their business and a system of insight such as a data warehouse from which to gather business intelligence (BI). With the advent of big data, a second system of insight, the data lake, appeared to serve up artificial intelligence and machine learning (AI/ML) insights. Many organizations, however, are finding this paradigm of relying on two separate systems of insight untenable.”[4] Pablo Junco Boquer, a Senior Director at Microsoft, notes, “The concept of data lakehouse [was] introduced by Databricks in 2020. Databricks is proposing the idea of a single data lakehouse capable of supporting ML and analytics in the same place and avoiding silos.”[5] Futurist and business consultant Bernard Marr adds, “A data lakehouse is built to house both structured and unstructured data. This means that businesses that can benefit from working with unstructured data (which is pretty much any business) only need one data repository rather than requiring both warehouse and lake infrastructure.”[6]

Data Fabric. Tech journalist Alex Woodie reports, “Forrester analyst Noel Yuhanna was among the first individuals to define the data fabric back in the mid-2000s. Conceptually, a big data fabric is essentially a metadata-driven way of connecting a disparate collection of data tools that address key pain points in big data projects in a cohesive and self-service manner. Specifically, data fabric solutions deliver capabilities in the areas of data access, discovery, transformation, integration, security, governance, lineage, and orchestration.”[7] Mike Ashwell, Vice President and General Manager of Data Management at Mastech InfoTrellis, explains the concept this way: “Data fabric is a design concept that integrates data sources into a layer, with connected processes. It drives the continuous analysis of discoverable metadata assets across a hybrid or multi-cloud environment, to produce business-critical intelligence.”[8]

Data Mesh. Woodie writes, “While a data mesh aims to solve many of the same problems as a data fabric — namely, the difficulty of managing data in a heterogenous data environment — it tackles the problem in a fundamentally different manner. In short, while the data fabric seeks to build a single, virtual management layer atop distributed data, the data mesh encourages distributed groups of teams to manage data as they see fit, albeit with some common governance provisions. The data mesh concept was first written down by Zhamak Dehghani, who [was, at the time,] the director of next tech incubation at Thoughtworks North America.” He adds, “The key insight that Dehghani brought to bear on the problem is that data transformation cannot be hardwired into the data by engineers, but instead should be a sort of filter that is applied on a common set of data that’s available to all users. So instead of building a complex set of ETL pipelines to move and transform data to specialized repositories where the various communities can analyze it, the data is retained in roughly its original form, and a series of domain-specific teams take ownership of that data as they shape the data into a product.” According to Mehta, “The main advantage that the data mesh offers is that the self-service Infrastructure-as-a-Platform provides the teams that requisition data along with monitoring, logging, alerting, and standardization — all with a standard process that is the same across the board and which is also domain agnostic.”

Concluding Thoughts

To remain competitive, enterprises must gather, store, and analyze data in the manner that best allows it to leverage the data. Mehta notes, “The architecture comes when the needs are properly defined, the data understood, and the processes in the organization accounted for.” Sheffi concludes, “Companies that are able to create granular, accurate demand forecasts that can be modified on the fly in response to unexpected demand shifts, and closely tie these to manufacturing and delivery operations, can avoid missteps and stay ahead of the competition. … Big data and advanced analytics can take these capabilities to unimaginable levels. These technologies will yield insights into customer buying behaviors that enable companies to track market fluctuations more precisely, and to anticipate shifts in buying preferences with remarkable prescience.” All these benefits begin with gathering the right data then storing and managing it in the most advantageous way.

Footnotes
[1] Yossi Sheffi, “What is a Company’s Most Valuable Asset? Not People,” LinkedIn, 19 December 2018.
[2] Yash Mehta, “Data Fabric vs. Data Mesh: Key Differences and Similarities,” RT Insights, 15 May 2022.
[3] Heine Krog Iversen, “Evaluating Data Lakes vs. Data Warehouses,” Dataversity, 21 March 2022.
[4] Stan Gibson, “The rise of the data lakehouse: A new era of data value,” CIO 18 August 2022.
[5] Pablo Junco Boquer, “Evolving Big Data Strategies With Data Lakehouses And Data Mesh,” Forbes, 24 August 2022.
[6] Bernard Marr, “What Is A Data Lakehouse? A Super-Simple Explanation For Anyone,” Forbes, 18 January 2022.
[7] Alex Woodie, “Data Mesh Vs. Data Fabric: Understanding the Differences,” Datanami, 25 October 2021.
[8] Mike Ashwell, “The Ideal Data Fabric Architecture for Business Transformation,” Dataversity, 9 May 2022.

Data Privacy Day 2023

Tomorrow, 28 January, is Data Privacy Day or, as it is known in Europe, Data Protection Day. You may never have heard of it; nevertheless,

Janus Should Be the God of Data

Welcome to the New Year! Readers interested in history and mythology are probably aware that the month of January is named after the Roman god