Challenges of a Data Lake
The data lake technology has emerged recently and rapidly and hence hasn’t matured yet. This fact creates challenges in its implementation. Further, creating data requires millions in investment. So the companies are not exactly gung-ho about something which requires considerable investment and yet is fraught with risks. A data lake cannot assure tangible benefits too soon, as it is an emerging technology. We discuss some more challenges of a data lake below.
Identification of a use case
It is a catch 22 situation as until you create a data lake you can’t find a use case. And until you have a good use case you cannot get money sanctioned for implementing a data lake.
Since a data lake requires a significant investment in people and technology, the executives are demand a sizable business outcome. So the advocates of a data lake mostly come up with use cases that need data from diverse data sources. If the analysts could have answered the question with data from a single data source, they would have already answered it using the existing technology. Getting data from various data sources brings multiple stakeholders into the picture. This is when organizational culture and politics come into play. Let‘s take a real incident for example.
We were implementing a solution for supply chain optimization (SCO) at a major company.
They already had found a fix on a Teradata warehouse technology stack that used data from three data sources. The new scenario was suggesting the usage of data from twelve sources. We suffered a considerable backlash from the existing SCO group because the talk of a new solution was hurting them. The current fix was their bread earner. So they had all the intent to find problems with the new solution. Moreover, they had the ways and means to bring the new solution down. They were sitting on all the rules of supply chain optimization (SCO). They did what they could to fail the new venture. So what could have been a way out of this logjam? The company should have involved and trained the same group in the new venture. Then they would have given their best to make the data lake a success.
The term ‘data lake’ has been made famous by HortonWorks. Hence data lakes are mostly associated with Hadoop – the technology HortonWorks uses. One can also create a data lake without Hadoop. But here we will only list Hadoop-based technological challenges.
Large Datasets versus Small Datasets
A Hadoop data lake is suitable for big datasets but not a smart choice for small datasets. Hadoop is a parallel programming paradigm; it stores everything in a block which is usually 260 megabytes. If you have data less than that Hadoop does not perform optimally. For example, if a table has up to million records Hadoop would take a minute to respond to a query while in Oracle it would take a second. But if you have a billion records Hadoop will return the query in 2-5 mins whereas Oracle may not respond or may take hours.
Updates of data
Hadoop only supports inserts but no edits and deletes as it was not natively designed for those functions.
Decoupling of data and metadata
People are used to a paradigm where data and metadata are tightly coupled, so they are having a tough time understanding this shift.
Being Open Source
It allows everyone to add their functionality which only a few people understand. But it creates a bug in the entire system. For instance, Ambari has a feature that if a service goes down, it restarts automatically. Why? The reason is that nobody was able to figure out why services were going down in the first place.
Too many moving parts
An Oracle developer is only concerned about one core product – Oracle but if you ask a Hadoop developer – they have to worry about Hive, Zookeeper, Hbase, Flume, Scoop, Nifi, Impala, Druid, etc.