The Grouparoo Blog
When it comes to storing large volumes of data, a simple database will be impractical due to the processing and throughput inefficiencies that emerge when managing and accessing big data. This article looks at the options available for storing and processing big data, which is too large for conventional databases to handle.
There are two main options available, a data lake and a data warehouse. We'll look at the differences and compare and contrast the factors that affect choosing data lakes vs data warehouses.
A data warehouse is a unified repository where data from diverse sources undergo aggregation and integration into a usable source of information. To achieve this, a data warehouse will require processes to gather and integrate data, manage data quality, create metadata, and support any regulatory compliance and governance procedures.
The essential traits of a data warehouse are:
- Data organization uses subject-orientated processes in a defined data model to support access.
- The integration of data from separate sources becomes a self-consistent data set with the removal of duplications and flagging of inconsistencies or, if possible, their resolution.
- Data storage uses a non-volatile environment with strict management controls on the modification and deletion of data.
- Data storage over significant periods enables the mining of historical information for trend and forecast purposes.
The storage of data in the warehouse can employ on-time processes for real-time updates or offline processes that update data periodically following a defined time cycle.
There are two main techniques for storing information in a data warehouse: a dimensional approach or a normalized approach.
- In a dimensional approach, data partitioning techniques separately store facts and dimensions. The facts are valuable information, and the dimensions provide context to these facts. Typically, organizational business processes and systems define the facts, while dimensions provide the metrics for the facts. For example, the facts may detail information related to sales volumes for products will the dimensions give the date, destination, and customer details for the sales.
- In a normalized approach, data structuring extends the database approach with tables grouped by subject with complex connections providing relational context to the table entries. This approach offers greater flexibility for integrating new data sources into the data warehouse at the expense of clarity when viewing the stored data.
The fundamental purpose of a data warehouse is the aggregation of information from diverse sources to inform data-driven decision-making processes.
A data lake is a central repository where unprocessed data collected from different sources moves into storage in its original format where processes can access the data. There is no processing to integrate and manage data, including quality checks or detect inconsistencies, duplications, or discrepancies. The data retains its structure, irrespective of the data structure already present in the data lake.
Data lakes are unconstrained by the nature of the data they collect and store, offering total flexibility for storing big data. This provides significant benefits to storing data generated by sensors where the relationships and dependencies between data are unknown, and the raw information's integrity is critical. Typical applications are in scientific experimentation and observation processes where data consumers will not fully understand the nature of the data until after the completion of data processing and analysis. A data lake offers the ideal solution for storing such data of unknown relationships.
When looking at the difference between data lake and data warehouse, the following key properties distinguish data lakes vs data warehouses.
Like databases, data warehouses can only store structured data, while data lakes can store both structured and unstructured data.
The collection and aggregation of data into a data warehouse requires data transformation processing to ensure the data complies with the data model used by the data warehouse before its storage. The term for this approach is a schema-on-write process and is the same method that databases employ. Data lakes accept and store raw data in any format. Any formatting or structuring performance only occurs once consumer processes retrieve the data by accessing the data lake. The term for this approach is a schema-on-read process.
The highly structured nature of data warehouses limits their flexibility to change; implementing any change to the data model will require the transformational processing of the entire data warehouse content. This resource-heavy process will restrict access to the data during the transformation process. The benefit to consumers of the data is that process that consumes the data will have a clear definition of the structure of the entire data set.
Data lakes impose no structural constraints on the data they hold and accept data in any new format without impacting access to the data. However, adapting processing to changes in structure and models will affect those applications that consume the data. This places the burden on data consumers to understand and transform the data within the data lake to extract information from diverse structures.
Data warehouse solutions are available that offer robust security controls to manage the confidentiality, integrity, and availability of the stored data as necessary for the sensitivity of the data. The application of security controls can be at the top level across all stored data and specific records or data sets within the stored data where pragmatically required. In addition, data transformational processing can include established techniques such as encryption and hashing to manage security on acquired data before storage. Data lakes, by contrast, are immature solutions with limited security controls available to protect raw data held in aggregated form.
The structured nature of data held in data warehouses enhances the useability of the data across a broad user base that lends itself to general business processes. Conversely, the unstructured nature of data held in data lakes limits useability to data processing specialists, researchers, and scientific communities.
While data warehouses typically require less storage space than the equivalent data lake due to data aggregation and rationalization processes, the structured storage requirements of a data warehouse are more costly when compared with the corresponding unstructured data lake resources. In addition, data warehouses require more complex processing and management applications that also increase costs. Data lakes, by contrast, have minimal processing and management requirements, with the complexity and cost of processing shifted over to the applications that consume the data held in a data lake.
Data lakes are becoming more popular due to financial benefits. Still, for the business user, this decision can have wide-ranging implications on business processes and practices, creating additional costs that outweigh the initial savings.
Data warehousing processes can ensure that only data above a quality threshold that the business requires will move into storage and be available for use. Data lakes, by contrast, will store all the data presented for storage, irrespective of its quality and usefulness.
Data warehousing processes can ensure data acquisition is only from trusted and verified sources, disregarding any data that fails to meet defined acceptability policies. Data lakes, by contrast, will store all the data presented for storage, irrespective of its source.
Data warehousing processes can access specific information quicker than data lake processes due to the structured nature of the data, reducing latency for real-time information.
Data warehousing processes can prioritize data based on the value to the business of the information it holds to maximize productivity and ensure decision-making processes use the best available data. Data lakes will simply present all the data on request, irrespective of its quality and usefulness.
In this article, we have looked at what is a data lake vs data warehouse to understand which option is most suitable for significant volume data storage. Although both options have their advantages and disadvantages, ultimately, the choice will depend on the nature of the data, its intended purpose, and the nature of the data consumers.
The choice of data warehouse vs data lake should rest on how you intend to use the data and leverage its inherent information rather than simple financial costs for the storage solution. Cost-benefit analysis needs to consider the overall end-to-end business process and not focus on segments in isolation. Correctly choosing whether you use a data lake vs warehouse will have significant business impacts, so consider all these factors.
featured image via unsplash
Tagged in Data
See all of Stephen Mash's posts.
Stephen is a UK-based freelance technology writer with a background in system development and assurance, primarily focused on high-integrity applications.
Learn more about Stephen @ https://www.linkedin.com/in/steve-mash-exosure