Hadoop best practices for ETL

Hadoop best practices for ETL

Hadoop, an open-source framework has been around for quite some time in the industry. You would see a multitude of articles on how to use Hadoop for various data transformation needs, features, and functionalities.

But do you know what are the Hadoop best practices for ETL? Remember that as this is an open-source project, there are many features available in Hadoop. But it may not be the best option to use them in new Hadoop ETL development. I have seen from experience that old database programmers use Oracle/SQL Server’s techniques in Hive and screw up the performance of the ETL logic.

So, why do these techniques exist when it is not in your best interest to use them? It is because of Hadoop’s open-source nature and competition in the industry. If a client is doing a Teradata conversion project, they can save enough dollars by just converting Teradata logic to Hive, and for performance gain, they don’t mind paying for additional hardware. This is why all the features of all the traditional databases exist in Hive. 

The Dos and Don'ts

If you are writing a new logic in Hadoop, use the proposed methodology for Hadoop development.

Do not use Views

Views are great for transactional systems, where data is frequently changing, and a programmer can consolidate sophisticated logic into a View. When source data is not changing, create a table instead of a View.

Do not use small partitions

In a transactional system, to reduce the query time, a simple approach is to partition the data based on a query where clause. While in Hadoop, the mapping is far cheaper than start and stop of a container. So use partition only when the data size of each partition is about a block size or more (64/128 MB).

Use ORC or Parquet File format

By changing the underneath file format to ORC or Parquet, we can get a significant performance gain.

Avoid Sequential Programming (Phase Development)

We need to find ways, where we can program in parallel and use Hadoop’s processing power. The core of this is breaking logic into multiple phases, and we should be able to run these steps in parallel.

Managed Table vs. External Table

Adopt a standard and stick to it but I recommend using Managed Tables. They are always better to govern. You should use External Tables when you are importing data from the external system. You need to define a schema for it after that entire workflow can be created using Hive scripts leveraging managed tables.

Pig vs. Hive

If you ask any Hadoop vendor (Cloudera, Hortonworks, etc.), you will not get a definitive answer as they support both languages. Sometimes, logic in Hive can be quite complicated compared to Pig but I would still advise using Hive if possible. This is because we would need resources in the future to maintain the code.

There are very few people who know Pig, and it is a steep learning curve. And also, there is not much investment happening in Pig as compared to Hive. Various industries are using Hive and companies like Hortonworks, AWS, MS, etc. are contributing to Hive.

Phases Development

So how would you get the right outcome without using Views? You would get that by using Phase Development. Keep parking processed data into various phases and then keep treating it to obtain a final result.

Basic principles of Hadoop programming

1. Storage is cheap while processing is expensive.

2. You cannot complete a job in sub-seconds, it would take way more than that, usually a few seconds to minutes.

3. It’s not a transactional system; source data is not changing until we import it again.