A data lake is a reservoir which can store vast amounts of raw data in its native format. This data can be –
Structured data from relational databases (rows and columns),
Structured data from NoSQL databases (like MongoDB, Cassandra, etc.),
Semi-structured data (CSV, logs, XML, JSON),
Unstructured data (emails, documents, PDFs) and
Binary data (images, audio, video).
The purpose of a data lake, a capacious and agile platform is to hold all the data of an enterprise at a central platform. By this, we can do comprehensive reporting, visualization,analytics and eventually glean deep business insights.
How is the working of a data lake different from a data warehouse?
A broad understanding is that a data warehouse is a fully schematized data storage and processing platform whereas a data lake is more fluid in its working as the name suggests. Given below are the few steps which are done differently in a data warehouse versus a data lake:
Decoupling of metadata and data
In a data warehouse, first you define metadata, and then you add data to it, but in data lake first, you ingest data and then define the metadata around it. In this way, you can assign multiple metadata tags to the same data set.
A data warehouse can scale up to few terra bytes whereas in a data lake you can store up to few petabytes of data.
Decoupling of storage and processing
In a data lake, we can store data and process it separately. To know more about how this is made possible, read about various technology stacks used in a data lake. Some use cases may require more storage whereas others need more processing power. Accordingly, we can scale any of these two. It can save a lot of money for the company.
A data warehouse contains small datasets; hence its data processing speed is good. But a data lake holds large datasets which takes a toll on its processing speed.