The concept of data lakes is not new, but it is becoming increasingly important. And not just because of full-service cloud data solutions. It reduces the biggest risk associated with data lakes – that the lake becomes a swamp. In this short article, we explain what data lakes are, how they can be used alongside and not necessarily instead of a data warehouse, and why they are so important.
A data warehouse is a repository that can store all types of data in their original form: structured, semi-structured and unstructured data. Structured data is strictly defined and standardized, stored in traditional tables and queried using SQL. Semi-structured data sometimes have a defined format but are not used, e.g. CSV files or image file metadata. Unstructured data is data that does not follow a predefined format, e.g. images, videos, audio, text, etc., and is therefore more difficult to process mechanically.
The need to create a data lake is justified for several reasons:
The Diversity of Data
In today’s digital world, not all the data needed is structured in an SQL database. Streaming data, social media posts, IoT data, audio or video recordings contain valuable business information also needs to be analyzed.
Audience And Their Needs
Different stakeholders want to process different data and have very different needs. BI experts or business analysts want to process and structure data in order to gain insights from analytical tools. Data scientists are more interested in the raw form and use languages such as Python or R to create models. To avoid creating a closed space for a particular audience, data aggregation is used so that everyone gets the same data from the same repository, possibly from multiple domains.
Evaluation Time
As the data is stored in its original format when it is added to the dataset, it does not need to be transformed, so the time needed to make the data usable is not very long. This does not mean that the transformation cannot be done at a later stage, especially if you are using a data warehouse-based data storage environment.
However, not all users need to be converted. Machine learning or streaming data processing is often based on raw data, and resource-intensive data warehouse conversion slows down real-time processing.
Can You Replace a Data Warehouse?
Does the data lake replace the data warehouse? No, but it can. A data warehouse is used specifically for data analysis and BI. Various raw data are processed (cleaned, transformed, formatted, merged, etc.) and stored according to a predefined structure. This allows BI teams to work with clean data. When a data warehouse is linked to a data lake, the raw data is first loaded into the repository and then the data warehouse becomes one of many data consumers of the repository. This configuration combines both methods.
Depending on the complexity of the data warehouse and the analytical functions based on it, the data lake can replace the traditional data warehouse. In a data warehouse architecture, several zones are usually created.
Depending on the concept or the cloud solution, the zones are named differently: landing, raw data, storage, transformation, transport, production, etc. However, the principle is always the same: data is zoned until it reaches the gold level, i.e. higher quality and value. The data warehouse may classically depend on this data quality or parts of the data warehouse preparation may become obsolete. If the raw material is still needed in its original form, it is sent to the previous area.
Data Governance Is Key
Despite all the benefits, it is important that the data environment does not become a data swamp. This means that data is loaded into a dataset and used in a random way, without further checking or development. It is therefore important to address data governance issues, i.e. who is responsible for which data, who has access to and can use the data, what internal procedures and policies are in place, etc.
Here are some examples:
Data Directory
A data directory is used to document and evaluate all the data available in the organization. It is a very important point in the data environment, as data must be clearly structured due to its volume and variety. Once you know what data exists and who is responsible for it, you can then determine how to access and use it.
Ensuring Site Security
As with the architecture, the concept of access must be clearly defined from the outset: at what level (attribute, file, repository, etc.), how, what user groups and roles are used, etc. The multiplicity of data in one place (one lake) requires a clear concept of access, otherwise unauthorized access to data can be gained and misuse can take place.
Quality
The fact that a data lake allows the use of unstructured data does not mean that the quality of the data is negligible. Even within a data lake, transformation and cleaning must be performed at some point. As described above, the data lake can be divided into different zones and the quality can be improved by transforming the data as it passes through these zones.
Conclusion
When dealing with large amounts of data in different formats, it is worth considering the concept of a data lake. Cloud solution providers often solve data management challenges with dataset solutions that provide tools and methods for creating and managing data warehouse in a secure and organized manner.