Hadoop Best Practices for Data Ingestion

Hadoop Best Practices for Data Ingestion

Hadoop Data ingestion is the beginning of your data pipeline in a data lake. It means taking data from various silo databases and files and putting it into Hadoop. Sounds arduous? For many companies, it does turn out to be an intricate task. That is why they take more than a year to ingest all their data into Hadoop data lake

Why does that happen? The reason is as Hadoop is an open source; there are a variety of ways you can ingest data into Hadoop. It gives every developer the choice of using her/his favorite tool or language to ingest data into Hadoop. Developers while choosing a tool/technology stress on performance, but this makes governance very complicated.

Start your Data Governance learning journey with comprehensive resources like lessons, best practices, and templates. Enter OvalEdge Academy

The Hadoop Distributed File System (HDFS)

Hadoop uses a distributed file system that is optimized for reading and writing of large files. When writing to HDFS, data are “sliced” and replicated across the servers in a Hadoop cluster.

The slicing process creates many small sub-units (blocks) of the larger file and  transparently writes them to the cluster nodes. The various slices can be processed in parallel (at the same time) enabling faster computation. The user does not see the file slices but interacts with the whole file. When transferring files out of HDFS, the slices are assembled and written as one file on the host file system.

Can Hadoop Data Ingestion be Made Simpler and Faster?

Definitely. For that, Hadoop architects need to start thinking about data ingestion from management’s point of view too. By adopting these best practices, you can import a variety of data within a week or two.

Moreover, the quicker we ingest data, the faster we can analyze it and glean insights. Please note here I am proposing only one methodology – which is robust, is widely available and performs optimally. The idea is to use these techniques so we can ingest all the data within few weeks, not months or years. Now, let’s have a look at how we import different objects:

File Ingestion

Ingestion of file is straightforward.  The optimal way is to import all the files into Hadoop or Data Lake, to load into Landing Server, and then use Hadoop CLI to ingest data. For loading files into landing server from a variety of sources, there is ample technology available.  Keep using what you are and just use Hadoop CLI to load the data into Hadoop, or Azure Data Lake, or S3 or GCS (Google Cloud Storage)

Database Ingestion

Now, this is a significant deal. I have seen companies using Sqoop (Variety of ways in Sqoop), Ni-Fi, and other tools to load the database into Hadoop. So here is my simple guide.

Database Dump

When data is smaller, like less than a million rows, you can afford to load this database dump on daily/hourly basis. Do not create change-data-capture for smaller tables. It would create more problems in Hadoop. The tables which have 100 million+ records, use multiple threads of Sqoop (-m) to load into Hadoop.

Change Data Capture

Do ‘Change Data Capture’ (CDC) only for the tables which are large ( at least 10M+). For CDC you can use either trigger on the source table ( I know DBAs don’t prefer that), or use some logging tool. These tools are proprietary for every database. Golden Gate for Oracle, SQL Server CDC, etc. Once you ingest CDC into Hadoop, you need to write Hive queries to merge these tables. You can also use OvalEdge time machine to process these transactions.

Streaming Ingestion

Data appearing on various IOT devices or log files can be ingested into Hadoop using open source Ni-Fi. I know there are multiple technologies (flume or streamsets etc.), but Ni-Fi is the best bet. After we know the technology, we also need to know that what we should do and what not.

Related: What is Unstructured Data and How to Process it on Hadoop?

The Dos and Don’ts of Hadoop Data Ingestion

  • Do not create CDC for smaller tables; this would create more problem at a later stage.

  • When you do a CDC, try to merge to main tables, not more than hourly. If you want to do every minute or so, you are doing something wrong. Keep is either on daily basis or hourly max.

  • Use Sqoop -m 1 option for smaller tables.

  • Use -queries option all the time, do not use -table option.

  • Directly load data into a managed table. Do not use external tables. Governing external tables is hard.

  • Do not import a BLOB or a CLOB (Character Large Object) field using Sqoop. If you need to do that, write some custom logic or use OvalEdge.

  • Import into hive table, where all the columns are String type. Now using additional transformation to convert this String to appropriate

  • Date/timestamp/double format. Or use OvalEdge to load with a single click.