Challenges of a data lake

The data lake technology has emerged recently and rapidly and hence hasn’t matured yet. This fact creates challenges in its implementation. Further, creating data requires millions in investment. So the companies are not exactly gung-ho about something which requires considerable investment and yet is fraught with risks. A data lake cannot assure tangible benefits too soon, as it is an emerging technology. We discuss some more challenges of a data lake below.

Identification of a use case

It is a catch 22 situation as until you create a data lake you can’t find a use case. And until you have a good use case you cannot get money sanctioned for implementing a data lake.

Organizational hurdles

Since a data lake requires a significant investment in people and technology, the executives are demand a sizable business outcome. So the advocates of a data lake mostly come up with use cases that need data from diverse data sources. If the analysts could have answered the question with data from a single data source, they would have already answered it using the existing technology. Getting data from various data sources brings multiple stakeholders into the picture. This is when organizational culture and politics come into play. Let‘s take a real incident for example.

We were implementing a solution for supply chain optimization (SCO) at a major company.

They already had found a fix on a Teradata warehouse technology stack that used data from three data sources. The new scenario was suggesting the usage of data from twelve sources. We suffered a considerable backlash from the existing SCO group because the talk of a new solution was hurting them. The current fix was their bread earner. So they had all the intent to find problems with the new solution. Moreover, they had the ways and means to bring the new solution down. They were sitting on all the rules of supply chain optimization (SCO). They did what they could to fail the new venture. So what could have been a way out of this logjam? The company should have involved and trained the same group in the new venture. Then they would have given their best to make the data lake a success.

Technological challenges

The term ‘data lake’ has been made famous by HortonWorks. Hence, data lakes are mostly associated with Hadoop – the technology HortonWorks uses. One can also create a data lake without Hadoop. But here we will only list Hadoop-based technological challenges.

LargeDatasets versus Small Datasets

A Hadoop data lake is suitable for big datasets but not a smart choice for small datasets. Hadoop is a parallel programming paradigm; it stores everything in a block which is usually 260 megabytes. If you have data less than that Hadoop does not perform optimally. For example, if a table has up to million records Hadoop would take a minute to respond to a query while in Oracle it would take a second. But if you have a billion records Hadoop will return the query in 2-5 mins whereas Oracle may not respond or may take hours.

Updates of data

Hadoop only supports inserts but no edits and deletes as it was not natively designed for those functions.

Decoupling of data and metadata

People are used to a paradigm where data and metadata are tightly coupled, so they are having a tough time understanding this shift.

Being open source

It allows everyone to add their functionality which only a few people understand. But it creates a bug in the entire system. For instance, Ambari has a feature that if a service goes down, it restarts automatically. Why? The reason is that nobody was able to figure out why services were going down in the first place.

Too many moving parts

An Oracle developer is only concerned about one core product – Oracle but if you ask a Hadoop developer – they have to worry about Hive, Zookeeper, Hbase, Flume, Scoop, Nifi, Impala, Druid, etc.

What you should do now

Schedule a demo to learn more about OvalEdge
Increase your knowledge on everything related to data governance with our free whitepapers, webinars and academy
If you know anyone who'd enjoy this content, share it with them via email, LinkedIn, Twitter or Facebook.

OvalEdge recognized as a leader in data governance solutions

SPARK Matrix™: Data Governance Solution, 2025

Final_2025_SPARK Matrix_Data Governance Solutions_QKS GroupOvalEdge 1

View

Total Economic Impact™ (TEI) Study commissioned by OvalEdge: ROI of 337%

“Reference customers have repeatedly mentioned the great customer service they receive along with the support for their custom requirements, facilitating time to value. OvalEdge fits well with organizations prioritizing business user empowerment within their data governance strategy.”

Download

Named an Overall Leader in Data Catalogs & Metadata Management

Download

Recognized as a Niche Player in the 2025 Gartner® Magic Quadrant™ for Data and Analytics Governance Platforms

Gartner, Magic Quadrant for Data and Analytics Governance Platforms, January 2025

Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

Challenges of a data lake

Identification of a use case

Organizational hurdles

We were implementing a solution for supply chain optimization (SCO) at a major company.

Technological challenges

LargeDatasets versus Small Datasets

Updates of data

Decoupling of data and metadata

Being open source

Too many moving parts

Find your edge now. See how OvalEdge works.

OvalEdge recognized as a leader in data governance solutions

Find your edge now. See how OvalEdge works.