A data catalog is a metadata repository which helps companies organize and find data that’s stored in their many systems. It works like a library catalog. But instead of detailing books and journals, it has information about tables, files, and databases. This information comes from a company’s ERP, HR, Finance, and E-commerce systems (as well as social media feeds). The catalog also shows where all the data entities are located. A data catalog contains a number of critical information about each piece of data, such as the data’s profile (statistics or informative summaries about the data), lineage (how the data is generated), and what others say about it. A catalog is the go-to spot for data scientists, business analysts, data engineers, and others who are trying to find data to build insights, discover trends, and identify new products for the company. A data catalog works differently than a data lake. While they are both a central repository of data, you must move all the data into the technology while using a data lake. For example, if the data lake is in S3, you must move all the data to S3. This can become very expensive and is only applicable for certain use cases. On the other hand, a data catalog contains the metadata and its whereabouts, which enables the user to move to the appropriate place.
A company collects and stores an abundance of their data, but it is inaccessible and unavailable to be analyzed. The plan to make data-driven business decisions is hindered at the very beginning. Here are some numbers about the current use of data: Only 0.5% of all data is currently analyzed. 1 Only 14% of business stakeholders make thorough use of customer insights. 2 Organizations that leverage customer behavioral insights outperform peers by 85% in sales growth and more than 25% in gross margin. 3
The first step for building a data catalog is collecting the data’s metadata. Data catalogs use metadata to identify the data tables, files, and databases. The catalog crawls the company’s databases and brings the metadata (not the actual data) to the data catalog.
By looking at the profile of data consumers view and understand the data quickly. These profiles are informative summaries that explain the data. For example, the profile of a database often includes the number of tables, files, row counts, etc.. For a table, the profile may include column description, top values in a column, null count of a column, distinct count, maximum value, minimum value and much more.
Data Lineage is a visual representation of where the data is coming from, where it moves and what transformations it undergoes over time. It provides the ability to track, manage and view the data transformation along its path from source to destination. Hence, it enables the analyst to trace errors back to the root cause in the analytics.
Through this feature, data consumers can discover related data across multiple databases. For example, an analyst may need consolidated customer information. Through the data catalog, she finds that five files in five different systems have customer data. With a data catalog and the help of IT, one can have an experimental area where you can join all the data and clean it. Then one can use that consolidated customer data to achieve your business goals.
A data catalog is an apt platform to host a business glossary and make it available across an organization. Business glossary is a document which enables data stewards to build and manage a common business vocabulary. This vocabulary can be linked to the underlying technical metadata to provide a direct association between business terms and objects.
The primary benefit of a data catalog is that it acts as a single source of reference for all one’s data needs in an organization. OvalEdge catalogs the entire data of various databases, file systems & visualization software. It creates a knowledge base of data by human curation and various data, machine learning and code parsing algorithms.
Here is the step-by-step process of building a data catalog.
The first step for building a data catalog is collecting the data’s metadata. The catalog crawls the company’s databases and brings the metadata (not the actual data) to the data catalog. Data catalogs then use this metadata to identify the data tables, the columns of the tables, files, and databases.
The next step is to profile the data to help data consumers view and understand the data quickly. These profiles are informative summaries that explain the data. For example, the profile of a database often includes the number of tables, files, row counts, etc. For a table, the profile may include column description, top values in a column, null count of a column, distinct count, maximum value, minimum value and much more.
The third step is to build a business glossary or upload an existing one into the data catalog. Business glossary is an enterprise-wide document created to improve business understanding of the data. It enables data stewards to build and manage a common business vocabulary. This vocabulary can be linked to the underlying technical metadata to associate business terms with objects. A business glossary can have multiple data dictionaries attached to it. A data dictionary is more technical in nature and tends to be system specific. It contains the description and Wiki of every table or file and all their metadata entities. Employees can collaborate to create a business glossary through web-based software or use an excel spreadsheet.
Marking relationships is the next vital step. Through this step, data consumers can discover related data across multiple databases. For example, an analyst may need consolidated customer information. Through the data catalog, she finds that five files in five different systems have customer data. With a data catalog and the help of IT, one can have an experimental area where you can join all the data, clean it. Then use that consolidated customer data to achieve your business goals.
After marking relationships, a Data Catalog builds lineage. A visual representation of data lineage helps to track data from its origin to its destination. It explains the different processes involved in the data flow. Hence, it enables the analyst to trace errors back to the root cause in the analytics. Generally, ETL (Extract, Transfer, Load) tools are used to extract data from source databases, transform and cleanse the data and load it into a target database. A data catalog parses these tools to create the lineage. Some of the ETL tools which can be parsed are –
In a table/file data is arranged in a technical format and not in a way to make the most sense to a business user. So we need human collaboration on data assets so that they can be discovered, accessed and trusted by business users. Below are a few techniques by which we can arrange data for easy discovery –
OvalEdge
5655 Peachtree Pkwy
Suite # 216
Peachtree Corners, GA 30092
OvalEdge
Deerfield Commons
12600 Deerfield Parkway,
Suite #100
Alpharetta, GA 30004
OvalEdge India
Manjeera Trinity Corporate
3rd Floor, Suite # 314
eSeva Ln, KPHB Phase 3, Kukatpally
Hyderabad, Telangana 500072