Here is the step-by-step process of building a data catalog.
The first step for building a data catalog is collecting the data’s metadata. The catalog crawls the company’s databases and brings the metadata (not the actual data) to the data catalog. Data catalogs then use this metadata to identify the data tables, the columns of the tables, files, and databases.
The next step is to profile the data to help data consumers view and understand the data quickly. These profiles are informative summaries that explain the data. For example, the profile of a database often includes the number of tables, files, row counts, etc. For a table, the profile may include column description, top values in a column, null count of a column, distinct count, maximum value, minimum value, and much more.
The third step is to build a business glossary or upload an existing one into the data catalog. A business glossary is an enterprise-wide document created to improve business understanding of the data. It enables data stewards to build and manage a common business vocabulary. This vocabulary can be linked to the underlying technical metadata to associate business terms with objects.
A business glossary can have multiple data dictionaries attached to it. A data dictionary is more technical in nature and tends to be system-specific. It contains the description and Wiki of every table or file and all their metadata entities. Employees can collaborate to create a business glossary through web-based software or use an excel spreadsheet.
Marking relationships is the next vital step. Through this step, data consumers can discover related data across multiple databases. For example, an analyst may need consolidated customer information. Through the data catalog, she finds that five files in five different systems have customer data.
With a data catalog and the help of IT, one can have an experimental area where you can join all the data, clean it, and then use that consolidated customer data to achieve your business goals.
After marking relationships, a Data Catalog builds lineage. A visual representation of data lineage helps to track data from its origin to its destination. It explains the different processes involved in the data flow. Hence, it enables the analyst to trace errors back to the root cause in the analytics.
Generally, ETL (Extract, Transfer, Load) tools are used to extract data from source databases, transform and cleanse the data and load it into a target database. A data catalog parses these tools to create the lineage. Some of the ETL tools which can be parsed are:
In a table/file data is arranged in a technical format and not in a way to make the most sense to a business user. So we need human collaboration on data assets so that they can be discovered, accessed, and trusted by business users. Below are a few techniques by which we can arrange data for easy discovery: