What is a Data Lake?

A data lake is a reservoir that can store vast amounts of raw data in its native format. This data can be –
  • Structured data from relational databases (rows and columns),
  • Structured data from NoSQL databases (like MongoDB, Cassandra, etc.),
  • Semi-structured data (CSV, logs, XML, JSON),
  • Unstructured data (emails, documents, PDFs) and
  • Binary data (images, audio, video).
The purpose of a data lake, a capacious and agile platform is to hold all the data of an enterprise at a central platform. By this, we can do comprehensive reporting, visualization, analytics, and eventually glean deep business insights.

How is the working of a data lake different from a data warehouse?

A broad understanding is that a data warehouse is a fully schematized data storage and processing platform whereas a data lake is more fluid in its working as the name suggests. Given below are the few steps which are done differently in a data warehouse versus a data lake:

Decoupling of metadata and data

In a data warehouse, first, you define metadata, and then you add data to it, but in a data lake, first, you ingest data and then define the metadata around it. In this way, you can assign multiple metadata tags to the same data set.

Scalability

A data warehouse can scale up to few terra bytes whereas in a data lake you can store up to few petabytes of data.

Decoupling of storage and processing

In a data lake, we can store data and process it separately. To know more about how this is made possible, read about various technology stacks used in a data lake. Some use cases may require more storage whereas others need more processing power. Accordingly, we can scale any of these two. It can save a lot of money for the company.

Performance

A data warehouse contains small datasets; hence its data processing speed is good. But a data lake holds large datasets which takes a toll on its processing speed.

Find your edge now. See how OvalEdge works.

ASK FOR A DEMO