Data
profiling is a critical input task to any database initiative that incorporates
source data from external systems or when data quality assurance for existing
systems is implemented. Whether it is a completely new database build,
importing a one-off file, developing an integration between systems, or simply enhancing an existing system, data profiling is a key analysis step in the
overall design. Allocating sufficient time and resources to conduct a thorough
data profiling assessment will help architects design a better solution and
reduce project risk by quickly identifying and addressing potential data
issues.
Data profiling is the statistical analysis and assessment of the quality of data values within a data
set for consistency, uniqueness and logic. The insight gained by data
profiling can be used to determine whether or not the data is fit for purpose. This may include whether or not the data is
acceptable to be used for a specific purpose, imported into an existing
application or database, or exported for use by another provider.
It can also be used to assess
data quality via a series of metrics and
determine whether or not the actual
values conform to and support the intent of the metadata in the source
data. Identifying inaccurate data is not an area that data profiling can
measure. However, through the
exploration of data profiling results with features such as drilling-down into
specific records that fit a profile, someone can make a decision that
information is inaccurate. Only business rule exceptions and anomalies can be
identified in the data profiling process.
Profiling tools evaluate the
actual content, structure and quality of the data by exploring relationships that exist between
value collections both within and across data sets. For example, by
examining the frequency distribution of different values for each column in a database table,
an analyst can gain insight into the type and use of each column. Cross-column
analysis can be used to expose embedded value dependencies and inter-table
analysis allows the analyst to discover overlapping value sets that represent relationships between entities.
Data
profiling is best scheduled prior to system design, typically occurring during
the discovery or analysis phase. The first step -- and also a critical
dependency -- is to clearly identify the appropriate person to provide the
source data and also serve as the “go to” resource for follow-up questions.
Once you receive source data extracts, you’re ready to prepare the data for
profiling. Loading data extracts into a database will allow you to freely leverage
SQL to query the data while also giving you the flexibility to leverage
existing profiling tools if needed.
When
creating a data profile, it is best to start with basic column-level analysis
such as:
- Distinct record count and percent: Finding the number of distinct records, when compared to the total number of records, can help identify any potential unique keys within a dataset.
- NULL count and percent: Finding the number of null, blank, or zero (for some numeric columns) can help identify any records that may not meet a minimum standard of columns or any business rule violations.
- String analysis: Finding the minimum, maximum, and average string lengths can help identify any column-level constraint violations and help design databases to accommodate string lengths.
- Numerical and date range analysis: Finding the minimum and maximum numerical and date values is helpful to identify the appropriate data types and identify any business rule violations.
- Key integrity: Checking to be sure that any unique keys don’t have zero, blank, or null values along with identifying any orphaned keys can help eliminate any key issues.
- Cardinality: Identifying any one-to-one, one-to-many, many-to-many relationships between the data sets is important modeling and BI tool set-up.
- Pattern, frequency distributions, and domain analysis: Pattern identification can be useful to check if data fields are formatted correctly.
Selecting
the right profiling technique will depend on a project’s objectives. If you’re
building a new database from scratch, take the time to execute on each of the
above bullet points. If you’re simply importing a new data set into an existing
database, select the most applicable tasks that apply to your source data.
Nearly
all of these tasks can be conducted by writing raw SQL. The basic profiling tasks
can usually be accomplished using a tool designed specifically for data
profiling. Many data profiling tools have been introduced into the marketplace
over the last several years to help with the data profiling process.
If your organization wants to use data profiling to improve data quality in an existing database or help design a system integration then shoot us an email. We would love to hear from you!
UPDATE: Our beta version of our new tool to profile data within The Raiser's Edge has been released and it can be found on the Tucamino Solutions Website.
If your organization wants to use data profiling to improve data quality in an existing database or help design a system integration then shoot us an email. We would love to hear from you!
UPDATE: Our beta version of our new tool to profile data within The Raiser's Edge has been released and it can be found on the Tucamino Solutions Website.