Choosing the Technology Stack for a Data Lake
Data Lake is a sophisticated technology stack and requires integration of numerous technologies for ingestion, processing, and exploration. Moreover, there are no standard rules for security, governance, operations & collaboration. It makes things more complicated. Wait! That’s not all. You also have hard SLAs for query processing time, data ingestion ETL pipelines. Lastly, the solution needs to be scalable from one user to thousands of users and from one kilobyte of data to few petabytes of data. As the big data industry is changing rapidly, you need to select technology which is here to stay and robust enough to comply with your SLAs. At OvalEdge our objective is to provide all the possible details about each solution to our customers and prospective customers so that they can decide which one caters best to their specific needs. Factors to consider for Technology Stack There are many other factors a business must look into before selecting their technology stack. Given below are those factors and how they fare amongst three types of infrastructure – On-Premise, on the Cloud and Managed Services.
|Monthly Cost||Economic with large datasets||Predictable||Predictable|
|Vendor Lock-in||Avoidable||Avoidable||Not Avoidable|
|Suitability||For large corporations||For all businesses||Ideal for startups|
|Investment||Substantial in the beginning||More as data grows||More as data grows|
We can divide it into two broad categories – Cloud vs. On-Premise. On the Cloud, many companies are offering managed services – Amazon, Microsoft, Google, etc. Whereas on-premise, the primary option available is HDFS (Hadoop Distributed File System).
It is the most used storage technology in Data Lake on the Cloud. The fact that one-fourth of the world’s data is stored on S3 is proof enough of its excellent scalability. However, there are various other pros and cons of S3.
- Vastly Scalable
- Has all enterprise features like security, availability- 99.99999%, backup uptime, etc.
- Security – The problem with S3 security is its management is intricate. Consider the recent example of Altyryx vulnerability. Even technologically advanced companies are finding it difficult to manage the security of S3. The risk is too high.
- Small Files: When you have lots of small files, and you want to analyze them together, S3 doesn’t perform optimally.
- API limitations: When we use its API, it's hard to do pagination. Too frequent changes in its API so it’s hard to catch up with the latest release of the client.
Azure Data Lake (ADL)
Microsoft recently launched ADL. We have done various POCs on ADL and found that it's easy to use and configure. When you use HD Insight with ADL, it’s straightforward to configure.
- Connectivity to HD Insight for processing. ADL is designed to work with small or large files and works well with Hadoop.
- Support – We found Microsoft support to be more responsible than AWS or Google.
- Limited knowledge
- Stats Refresh takes about 24 hrs
Google Cloud Storage (GCS)
Use GCS if you are planning to use Big Query.
Hadoop Distributed File System (HDFS)
HDFS is the only on-premise option available. It is highly reliable but comparatively tricky to manage. Cloudera Manager or Hortonworks Amabari are here to maintain HDFS efficiently. Earlier companies faced problems when they tried to upgrade or add a node on HDFS. But now these issues have been resolved, so overall HDFS is pretty stable from Hadoop 2.7.1 onwards. Processing
Hadoop has become a synonym for a data lake by its vast presence and use-cases across the domain. Its a distributed processing framework of large datasets. We can deploy Hadoop on-premise or on Cloud. Hortonworks, Cloudera, and MapR provide distributions of open source Hadoop technology. On the other hand, AWS, Microsoft, and Google offer their distribution of Hadoop as EMR, HDInsight, and Data Proc respectively. Cloud technology stacks are mostly elastic and built on their proprietary storage while CDH and HDP are made on an open-source HDFS.
Apache Spark provides a much faster engine for large-scale data processing, leveraging in-memory computing. It can run on Hadoop and Mesos on the cloud or in a standalone environment to create a unified compute layer across the enterprise. Tools and Languages Since Hadoop and Spark are a new generation processing layer, they provide various tools and languages to process data. Some of them are: