Thursday, November 27, 2014

What is Data Profiling?

Data profiling is a critical input task to any database initiative that incorporates source data from external systems or when data quality assurance for existing systems is implemented. Whether it is a completely new database build, importing a one-off file, developing an integration between systems, or simply enhancing an existing system, data profiling is a key analysis step in the overall design. Allocating sufficient time and resources to conduct a thorough data profiling assessment will help architects design a better solution and reduce project risk by quickly identifying and addressing potential data issues.

Data profiling is the statistical analysis and assessment of the quality of data values within a data set for consistency, uniqueness and logic.  The insight gained by data profiling can be used to determine whether or not the data is fit for purpose.  This may include whether or not the data is acceptable to be used for a specific purpose, imported into an existing application or database, or exported for use by another provider.

It can also be used to assess data quality via a series of metrics and determine whether or not the actual values conform to and support the intent of the metadata in the source data. Identifying inaccurate data is not an area that data profiling can measure.  However, through the exploration of data profiling results with features such as drilling-down into specific records that fit a profile, someone can make a decision that information is inaccurate. Only business rule exceptions and anomalies can be identified in the data profiling process.

Profiling tools evaluate the actual content, structure and quality of the data by exploring relationships that exist between value collections both within and across data sets. For example, by examining the frequency distribution of different values for each column in a database table, an analyst can gain insight into the type and use of each column. Cross-column analysis can be used to expose embedded value dependencies and inter-table analysis allows the analyst to discover overlapping value sets that represent relationships between entities.
  
Data profiling is best scheduled prior to system design, typically occurring during the discovery or analysis phase. The first step -- and also a critical dependency -- is to clearly identify the appropriate person to provide the source data and also serve as the “go to” resource for follow-up questions. Once you receive source data extracts, you’re ready to prepare the data for profiling. Loading data extracts into a database will allow you to freely leverage SQL to query the data while also giving you the flexibility to leverage existing profiling tools if needed.

When creating a data profile, it is best to start with basic column-level analysis such as:
  • Distinct record count and percent: Finding the number of distinct records, when compared to the total number of records, can help identify any potential unique keys within a dataset.
  • NULL count and percent: Finding the number of null, blank, or zero (for some numeric columns) can help identify any records that may not meet a minimum standard of columns or any business rule violations.
  • String analysis: Finding the minimum, maximum, and average string lengths can help identify any column-level constraint violations and help design databases to accommodate string lengths.
  • Numerical and date range analysis: Finding the minimum and maximum numerical and date values is helpful to identify the appropriate data types and identify any business rule violations.

 With the basic data profiling take care off, more advanced analysis can be implemented such as:
  • Key integrity: Checking to be sure that any unique keys don’t have zero, blank, or null values along with identifying any orphaned keys can help eliminate any key issues.
  • Cardinality: Identifying any one-to-one, one-to-many, many-to-many relationships between the data sets is important modeling and BI tool set-up.
  • Pattern, frequency distributions, and domain analysis: Pattern identification can be useful to check if data fields are formatted correctly.

Selecting the right profiling technique will depend on a project’s objectives. If you’re building a new database from scratch, take the time to execute on each of the above bullet points. If you’re simply importing a new data set into an existing database, select the most applicable tasks that apply to your source data.

Nearly all of these tasks can be conducted by writing raw SQL. The basic profiling tasks can usually be accomplished using a tool designed specifically for data profiling. Many data profiling tools have been introduced into the marketplace over the last several years to help with the data profiling process.  

If your organization wants to use data profiling to improve data quality in an existing database or help design a system integration then shoot us an email. We would love to hear from you!

UPDATE:  Our beta version of our new tool to profile data within The Raiser's Edge has been released and it can be found on the Tucamino Solutions Website.


No comments:

Post a Comment