The story of data lake

How the seeds of data lake were sown

At the start of the new millennium, the business world was witnessing a change happening. Unprecedented volumes and velocities of unstructured data getting generated were mandating a revolution in data storage and processing. This data seemed to hold tremendous value. But traditional relational databases and business intelligence software were unable to handle this. So, then what could? It was in 2003 and 2004 when coincidentally Google published a pair of research papers about its search engine technology. These articles talked about the Google File System (GFS), a means of storing data across distributed machines, and Google MapReduce, a distributed number-crunching platform that runs atop GFS. It paved the way for some techies to invent a massive data storage and processing system. Eventually, when the idea of a data lake took shape, the cry of joy that erupted was YAHOO! That’s right! It was Yahoo, the company which bootstrapped one of the most influential software technologies of the last five years – Hadoop. Hadoop could make the idea of a data lake materialize – a data repository that could hold all kinds of data in its native format. They formed a team to work on Hadoop. But midway a founder engineer broke off. He started to build his own company – Cloudera along with another engineer from Oracle.

The cause gathers mass

Rob Bearden, a serial entrepreneur from Atlanta, Georgia, was keeping a hawk’s eye on big data technology. He saw the rapid rise of Cloudera and felt this technology could reshape the way big businesses operated. He wanted to start a new company but with an experienced team to get a head start. So, he sent a mail to a Hadoop software lead at Yahoo known as Eric 14 as his last name of fourteen letters was a mouthful. Together, within six months, they convinced the Yahoo board to spin off Eric and about 24 other engineers. Dubbed Hortonworks, this new venture indeed had its hands on the right technology. Moreover, Rob, at this point fueled the enterprise with $100 million of venture capital money. A data lake built on Hadoop, an open-source platform designed to crunch vast amounts of data using an army of dirt-cheap servers, was taking shape.

Success at Web outfits

Today, Hadoop underpins not only Yahoo, but Facebook, Twitter, eBay, and dozens of other high-profile web outfits. Due to its success on the web, the data lake technology is primed for use in the corporate data world. In today’s internet-driven world, more and more data is hitting big businesses. A data lake is a way of dealing with that data.

The roadblocks to a data lake

“Change is hardest at the beginning, messiest in the middle, and best at the end.”

Robin Sharma, Bestselling author & leadership speaker

What we were talking about here is a shift in enterprise data management. If bringing a change in technology is hard, changing people’s mindset to embrace that technology is grueling. That made the road to the data lake fraught with risk. The risks became evident when some of the data lakes failed. Companies became wary of this new technology.

Bearings of an elephant

Hadoop has a long history with elephants. It was named Hadoop after the yellow stuffed elephant that once belonged to the son of one of its founder, Doug Cutting. The elephant not only rubs on its name but also on its work. As massive a storage system as a data lake has all the chances to get unwieldy. And it does.

All data in one place isn’t enough

If we have to extract value from all the data brought at one site, we have to cover few more steps. We have to organize and catalog it so that we can find it when needed. It took nearly a year for some companies to do that. This dampened the enthusiasm of businesses that were looking for quick returns.

How can we make this vital change in data technology less messy?

When we studied in detail why some data lakes failed; an exciting solution emerged – Virtual data lake. We can understand this concept better by comparing it to a retail scenario.

Virtual data lake approach: A virtual data lake is like Amazon

Amazon is a virtual platform which provides you the facility to buy things from various companies. It is a virtual marketplace. The information about the merchandise is at a central place and not the merchandise. Moreover, that information is well organized and searchable and has recommendations. A virtual data lake works on a similar concept. You only bring all the metadata to a central place and NOT the data. Whenever the data is needed, you can access it from where it is natively stored.

OvalEdge Recognized as a Leader in Data Governance Solutions

SPARK Matrix™: Data Governance Solution, 2025

Final_2025_SPARK Matrix_Data Governance Solutions_QKS GroupOvalEdge 1

View

Total Economic Impact™ (TEI) Study commissioned by OvalEdge: ROI of 337%

“Reference customers have repeatedly mentioned the great customer service they receive along with the support for their custom requirements, facilitating time to value. OvalEdge fits well with organizations prioritizing business user empowerment within their data governance strategy.”

Download

Named an Overall Leader in Data Catalogs & Metadata Management

Download

Recognized as a Niche Player in the 2025 Gartner® Magic Quadrant™ for Data and Analytics Governance Platforms

Gartner, Magic Quadrant for Data and Analytics Governance Platforms, January 2025

Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.