A step-by-step guide to build a data catalog

A step-by-step guide to build a data catalog

What is a data catalog?

A data catalog is a systematic list of all data assets in the form of tables, files, reports, etc. that exist in a company's various source systems. It works a lot like a fashion catalog. But instead of detailing swimsuits or shoes, it has information from a company’s ERP, HR, Finance, and E-commerce systems (as well as social media feeds). The catalog also shows where all the data entities are located.

A data catalog contains lots of critical information about each piece of data, such as the data’s profile (statistics or informative summaries about the data), lineage (how the data is generated), and what others say about it. The catalog is the go-to spot for data analysts, data stewards, data scientists,  and others to find and understand relevant datasets to build insights, discover trends, and identify new products for the company.

Why does a company need a data catalog?

According to Forrester Research, only 14% of business stakeholders make thorough use of customer insights. That’s because most companies don’t have access to their data.
Often, that happens due to the existence of many sources of data that cannot be easily mined. Just consider a typical data stack for a company:
A Typical Data Stack of a Company

A data catalog solves this problem, as well as others. It makes data easy to find. It gives a comprehensive view of each piece of data across the databases and identifies relationships. Plus, it forms a foundation for an organization's data governance program.

Read more on How to Build a Business Case for Data Governance

The step-by-step process of building a data catalog

Step1: Accessing metadata of all databases

The first step for building a data catalog is collecting the data’s metadata. Data catalogs use metadata to identify the data tables, files, and databases. The catalog crawls the company’s databases and brings the metadata (not the actual data) to the data catalog. Some data catalogs have restrictions about the types of databases they can crawl. OvalEdge doesn’t. It can crawl:

Data Management Platforms

  • Relational Databases - Oracle, SQL Server, MySQL, DB2, etc.
  • Data Warehouses - Teradata, Vertica, etc.
  • Object Storage
  • Cloud Platforms - Google Big Query, MS Azure Data Lake, AWS – Athena & Red Shift
  • Non-Relational / NoSQL Databases- Cassandra, MongoDB
  • Hadoop Distributions

Analytics and Business Intelligence Platforms

  • Modern Business Intelligence Platforms like Tableau, Qlik, Power BI, Sisense, and more
  • Analytic Applications

Custom Applications

Step2: Building a data dictionary

The second step is to build a data dictionary or upload an existing one into the data catalog. A data dictionary contains the description and Wiki of every table or file and all their metadata entities. Employees can collaborate to create a data dictionary through web-based software or use an excel spreadsheet.
An example of a data dictionary


Step 3: Profiling to see the data statistics

The next step is to profile the data to help data consumers view and understand the data quickly. These profiles are informative summaries that explain the data. For example, the profile of a database often includes the number of tables, files, row counts, etc. For a table, the profile may include column description, top values in a column, null count of a column, distinct count, maximum value, minimum value, and much more.
An example of a table profile in the data catalog, OvalEdge

Step 4: Marking relationship amongst data

Marking relationships is the next vital step. Through this step, data consumers can discover related data across multiple databases. For example, an analyst may need consolidated customer information. Through the data catalog, she finds that five files in five different systems have customer data. With the OvalEdge data catalog and the help of IT, one can have an experimental area where you can join all the data, clean it. Then use that consolidated customer data to achieve your business goals.
An example of relationships marked for the Table ‘Accounts’

The many ways relationships can be marked

Data relationships can be identified and marked amongst data in several ways:
  1. Through human knowledge
  2. An advanced algorithm can find out the connection. For example, in a table under the column name ‘vendor,’ you have an entity ‘Amazon.’ In another table, for column V1 you also have an entity ‘Amazon.’ The algorithm will figure out that there is a relationship between these tables.
  3. Gathering intelligence from queries – Some developers know the connection between different data sets from experience. Their query logs can be examined and then parsed to mark relationships.

Step 5: Building lineage

After marking relationships, a Data Catalog builds lineage. A visual representation of data lineage helps to track data from its origin to its destination. It explains the different processes involved in the data flow. Hence, it enables the analyst to trace errors back to the root cause in the analytics. Generally, ETL ( Extract, Transfer, Load) tools are used to extract data from source databases, transform and cleanse the data and load it into a target database. A data catalog parses these tools to create the lineage. Some of the ETL tools which can be parsed are –
  1. SQL Parsing
  2. Alteryx
  3. Informatica
  4. Talend

Step 6: Organizing data

In a table/file data is arranged in a technical format and not in a way to make the most sense to a business user. So we need human collaboration on data assets so that they can be discovered, accessed, and trusted by business users. Below are a few techniques by which we can arrange data for easy discovery –
  1. Tagging
  2. Organizing by the amount of usage
  3. Organizing by specific users usage
  4. Through automation – Sometimes when there is a large amount of data we can use advanced algorithms to organize data.

Making the catalog easily accessible

For high usage by data consumers, a data catalog should be easily accessible by the web app, mobile app, Android, and IOS apps. It should have chat features.

Ensure due security and governance

Since a data catalog houses all the data, it’s critical that there are guardrails protecting the most sensitive files. A data catalog should have the following features to ensure that only the right eyes see the data –
  1. Role-based security
  2. Information that who accessed what data at what time
  3. Auditing
  4. Encryption

Estimated time to build a data catalog

The time it takes to build a data catalog depends on the number of databases to be cataloged. Crawling and profiling can be done in a day or two. Uploading an existing data dictionary, building lineage,
and marking relationships for a database can be done in one week with the help of the database’s stakeholder. If an organization has to catalog ten databases, it may take four to five weeks. A big corporation can build its data catalog in about three months. A medium-sized company can build one in two to four weeks. Data catalogs are the new dynamic and agile tools needed by today’s data-driven organizations. They serve as a single source of reference for all your data needs. Static metadata repositories requiring expert IT users and loads of manual curation are a thing of the past.
TRY OVALEDGE To try OvalEdge right now, just click and register. 

Find your edge now. See how OvalEdge works.