The Story of Data Lake

How the seeds of data lake were sown

At the start of the new millennium, the business world was witnessing a change happening. Unprecedented volumes and velocities of unstructured data getting generated were mandating a revolution in data storage and processing. This data seemed to hold tremendous value. But traditional relational databases and business intelligence software were unable to handle this. So, then what could? It was in 2003 and 2004 when coincidentally Google published a pair of research papers about its search engine technology. These articles talked about the Google File System (GFS), a means of storing data across distributed machines, and Google MapReduce, a distributed number-crunching platform that runs atop GFS. It paved the way for some techies to invent a massive data storage and processing system. Eventually, when the idea of a data lake took shape, the cry of joy that erupted was YAHOO! That’s right! It was Yahoo, the company which bootstrapped one of the most influential software technologies of the last five years – Hadoop. Hadoop could make the idea of a data lake materialize – a data repository that could hold all kinds of data in its native format. They formed a team to work on Hadoop. But midway a founder engineer broke off. He started to build his own company – Cloudera along with another engineer from Oracle.

The cause gathers mass

Rob Bearden, a serial entrepreneur from Atlanta, Georgia, was keeping a hawk’s eye on big data technology. He saw the rapid rise of Cloudera and felt this technology could reshape the way big businesses operated. He wanted to start a new company but with an experienced team to get a head start. So, he sent a mail to a Hadoop software lead at Yahoo known as Eric 14 as his last name of fourteen letters was a mouthful. Together, within six months, they convinced the Yahoo board to spin off Eric and about 24 other engineers. Dubbed Hortonworks, this new venture indeed had its hands on the right technology. Moreover, Rob, at this point fueled the enterprise with $100 million of venture capital money. A data lake built on Hadoop, an open-source platform designed to crunch vast amounts of data using an army of dirt-cheap servers, was taking shape.

Success at Web outfits

Today, Hadoop underpins not only Yahoo, but Facebook, Twitter, eBay, and dozens of other high-profile web outfits. Due to its success on the web, the data lake technology is primed for use in the corporate data world. In today’s internet-driven world, more and more data is hitting big businesses. A data lake is a way of dealing with that data.

The roadblocks to a data lake

“Change is hardest at the beginning, messiest in the middle, and best at the end.”

Robin Sharma, Bestselling author & leadership speaker

What we were talking about here is a shift in enterprise data management. If bringing a change in technology is hard, changing people’s mindset to embrace that technology is grueling. That made the road to the data lake fraught with risk. The risks became evident when some of the data lakes failed. Companies became wary of this new technology.

Bearings of an Elephant

Hadoop has a long history with elephants. It was named Hadoop after the yellow stuffed elephant that once belonged to the son of one of its founder, Doug Cutting. The elephant not only rubs on its name but also on its work. As massive a storage system as a data lake has all the chances to get unwieldy. And it does.

All data in one place isn’t enough

If we have to extract value from all the data brought at one site, we have to cover few more steps. We have to organize and catalog it so that we can find it when needed. It took nearly a year for some companies to do that. This dampened the enthusiasm of businesses that were looking for quick returns.

How can we make this vital change in data technology less messy?

When we studied in detail why some data lakes failed; an exciting solution emerged – VIRTUAL DATA LAKE. We can understand this concept better by comparing it to a retail scenario.

Traditional Approach: ERP, CRM, Work Mgmt., HR, Salesforce ARE LIKE Mom and Pop stores

We can liken various applications which generate data to Mom and pop stores. You will have to go to multiple such outlets to get all the things you need.

Data Lake Approach: A data lake IS LIKE Walmart

A data lake is a massive repository like Walmart which physically brings in all kinds of products from various manufacturers. The challenge with this approach is how to organize this data so that it can be easily searched and accessed.

Virtual Data Lake Approach: A virtual data lake is like Amazon
Amazon is a virtual platform which provides you the facility to buy things from various companies. It is a virtual marketplace. The information about the merchandise is at a central place and not the merchandise. Moreover, that information is well organized and searchable and has recommendations. A virtual data lake works on a similar concept. You only bring all the metadata to a central place and NOT the data. Whenever the data is needed, you can access it from where it is natively stored.