Since a data lake requires a significant investment in people and technology, the executives are demand a sizable business outcome. So the advocates of a data lake mostly come up with use cases that need data from diverse data sources. If the analysts could have answered the question with data from a single data source, they would have already answered it using the existing technology. Getting data from various data sources brings multiple stakeholders into the picture. This is when organizational culture and politics come into play. Let‘s take a real incident for example.
They already had found a fix on a Teradata warehouse technology stack that used data from three data sources. The new scenario was suggesting the usage of data from twelve sources. We suffered a considerable backlash from the existing SCO group because the talk of a new solution was hurting them. The current fix was their bread earner. So they had all the intent to find problems with the new solution. Moreover, they had the ways and means to bring the new solution down. They were sitting on all the rules of supply chain optimization (SCO). They did what they could to fail the new venture. So what could have been a way out of this logjam? The company should have involved and trained the same group in the new venture. Then they would have given their best to make the data lake a success.
The term ‘data lake’ has been made famous by HortonWorks. Hence data lakes are mostly associated with Hadoop – the technology HortonWorks uses. One can also create a data lake without Hadoop. But here we will only list Hadoop-based technological challenges.
A Hadoop data lake is suitable for big datasets but not a smart choice for small datasets. Hadoop is a parallel programming paradigm; it stores everything in a block which is usually 260 megabytes. If you have data less than that Hadoop does not perform optimally. For example, if a table has up to million records Hadoop would take a minute to respond to a query while in Oracle it would take a second. But if you have a billion records Hadoop will return the query in 2-5 mins whereas Oracle may not respond or may take hours.
Hadoop only supports inserts but no edits and deletes as it was not natively designed for those functions.
People are used to a paradigm where data and metadata are tightly coupled, so they are having a tough time understanding this shift.
It allows everyone to add their functionality which only a few people understand. But it creates a bug in the entire system. For instance, Ambari has a feature that if a service goes down, it restarts automatically. Why? The reason is that nobody was able to figure out why services were going down in the first place.
An Oracle developer is only concerned about one core product – Oracle but if you ask a Hadoop developer – they have to worry about Hive, Zookeeper, Hbase, Flume, Scoop, Nifi, Impala, Druid, etc.
What you should do now
|