Take a tour
Book demo

Choosing the technology stack for a data lake

Data Lake is a sophisticated technology stack and requires integration of numerous technologies for ingestion, processing, and exploration. Moreover, there are no standard rules for security, governance, operations & collaboration. It makes things more complicated. Wait! That’s not all. You also have hard SLAs for query processing time, data ingestion ETL pipelines. Lastly, the solution needs to be scalable from one user to thousands of users and from one kilobyte of data to few petabytes of data. As the big data industry is changing rapidly, you need to select technology which is here to stay and robust enough to comply with your SLAs. At OvalEdge our objective is to provide all the possible details about each solution to our customers and prospective customers so that they can decide which one caters best to their specific needs. Factors to consider for Technology Stack There are many other factors a business must look into before selecting their technology stack. Given below are those factors and how they fare amongst three types of infrastructure – On-Premise, on the Cloud and Managed Services.

 
Factors On-Premise Cloud Managed Services
Maintenance Hard Hard Easy
Monthly Cost Economic with large datasets Predictable Predictable
Vendor lock-in Avoidable Avoidable Not avoidable
Suitability For large corporations For all businesses Ideal for startups
Investment Substantial in the beginning More as data grows More as data grows

Storage

We can divide it into two broad categories – Cloud vs. On-Premise. On the Cloud, many companies are offering managed services – Amazon, Microsoft, Google, etc. Whereas on-premise, the primary option available is HDFS (Hadoop Distributed File System).

Amazon S3

It is the most used storage technology in Data Lake on the Cloud. The fact that one-fourth of the world’s data is stored on S3 is proof enough of its excellent scalability. However, there are various other pros and cons of S3.

Pros
  • Vastly scalable
  • Has all enterprise features like security, availability- 99.99999%, backup uptime, etc.
  • Price
Cons
  • Security – The problem with S3 security is its management is intricate. Consider the recent example of Altyryx vulnerability. Even technologically advanced companies are finding it difficult to manage the security of S3. The risk is too high.
  • Small files: When you have lots of small files, and you want to analyze them together, S3 doesn’t perform optimally.
  • API limitations: When we use its API, it's hard to do pagination. Too frequent changes in its API so it’s hard to catch up with the latest release of the client.

Azure Data Lake (ADL)

Microsoft recently launched ADL. We have done various POCs on ADL and found that it's easy to use and configure. When you use HD Insight with ADL, it’s straightforward to configure.

Pros
  • Scalability
  • Connectivity to HD Insight for processing. ADL is designed to work with small or large files and works well with Hadoop.
  • Support – We found Microsoft support to be more responsible than AWS or Google.
Cons
  • Limited knowledge
  • Stats Refresh takes about 24 hrs

Google Cloud Storage (GCS)

Use GCS if you are planning to use Big Query.

Hadoop Distributed File System (HDFS)

HDFS is the only on-premise option available. It is highly reliable but comparatively tricky to manage. Cloudera Manager or Hortonworks Amabari are here to maintain HDFS efficiently. Earlier companies faced problems when they tried to upgrade or add a node on HDFS. But now these issues have been resolved, so overall HDFS is pretty stable from Hadoop 2.7.1 onwards. Processing

Hadoop clusters

Hadoop has become a synonym for a data lake by its vast presence and use-cases across the domain. Its a distributed processing framework of large datasets. We can deploy Hadoop on-premise or on Cloud. Hortonworks, Cloudera, and MapR provide distributions of open source Hadoop technology. On the other hand, AWS, Microsoft, and Google offer their distribution of Hadoop as EMR, HDInsight, and Data Proc respectively. Cloud technology stacks are mostly elastic and built on their proprietary storage while CDH and HDP are made on an open-source HDFS.

Spark clusters

Apache Spark provides a much faster engine for large-scale data processing, leveraging in-memory computing. It can run on Hadoop and Mesos on the cloud or in a standalone environment to create a unified compute layer across the enterprise. Tools and Languages Since Hadoop and Spark are a new generation processing layer, they provide various tools and languages to process data. Some of them are:

  • Hive
  • MapReduce
  • Oozie
  • Sqoop
  • Ni-Fi

What you should do now

  1. Schedule a demo to learn more about OvalEdge
  2. Increase your knowledge on everything related to data governance with our free whitepapers, webinars and academy
  3. If you know anyone who'd enjoy this content, share it with them via email, LinkedIn, Twitter or Facebook.

Find your edge now. See how OvalEdge works.

OvalEdge recognized as a leader in data governance solutions

SPARK Matrix™: Data Governance Solution, 2025
Final_2025_SPARK Matrix_Data Governance Solutions_QKS GroupOvalEdge 1
Total Economic Impact™ (TEI) Study commissioned by OvalEdge: ROI of 337%

“Reference customers have repeatedly mentioned the great customer service they receive along with the support for their custom requirements, facilitating time to value. OvalEdge fits well with organizations prioritizing business user empowerment within their data governance strategy.”

Named an Overall Leader in Data Catalogs & Metadata Management

“Reference customers have repeatedly mentioned the great customer service they receive along with the support for their custom requirements, facilitating time to value. OvalEdge fits well with organizations prioritizing business user empowerment within their data governance strategy.”

Recognized as a Niche Player in the 2025 Gartner® Magic Quadrant™ for Data and Analytics Governance Platforms

Gartner, Magic Quadrant for Data and Analytics Governance Platforms, January 2025

Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose. 

GARTNER and MAGIC QUADRANT are registered trademarks of Gartner, Inc. and/or its affiliates in the U.S. and internationally and are used herein with permission. All rights reserved.

Find your edge now. See how OvalEdge works.