Thursday, November 27, 2014

What is Data Profiling?

Data profiling is a critical input task to any database initiative that incorporates source data from external systems or when data quality assurance for existing systems is implemented. Whether it is a completely new database build, importing a one-off file, developing an integration between systems, or simply enhancing an existing system, data profiling is a key analysis step in the overall design. Allocating sufficient time and resources to conduct a thorough data profiling assessment will help architects design a better solution and reduce project risk by quickly identifying and addressing potential data issues.

Data profiling is the statistical analysis and assessment of the quality of data values within a data set for consistency, uniqueness and logic.  The insight gained by data profiling can be used to determine whether or not the data is fit for purpose.  This may include whether or not the data is acceptable to be used for a specific purpose, imported into an existing application or database, or exported for use by another provider.

It can also be used to assess data quality via a series of metrics and determine whether or not the actual values conform to and support the intent of the metadata in the source data. Identifying inaccurate data is not an area that data profiling can measure.  However, through the exploration of data profiling results with features such as drilling-down into specific records that fit a profile, someone can make a decision that information is inaccurate. Only business rule exceptions and anomalies can be identified in the data profiling process.

Profiling tools evaluate the actual content, structure and quality of the data by exploring relationships that exist between value collections both within and across data sets. For example, by examining the frequency distribution of different values for each column in a database table, an analyst can gain insight into the type and use of each column. Cross-column analysis can be used to expose embedded value dependencies and inter-table analysis allows the analyst to discover overlapping value sets that represent relationships between entities.
  
Data profiling is best scheduled prior to system design, typically occurring during the discovery or analysis phase. The first step -- and also a critical dependency -- is to clearly identify the appropriate person to provide the source data and also serve as the “go to” resource for follow-up questions. Once you receive source data extracts, you’re ready to prepare the data for profiling. Loading data extracts into a database will allow you to freely leverage SQL to query the data while also giving you the flexibility to leverage existing profiling tools if needed.

When creating a data profile, it is best to start with basic column-level analysis such as:
  • Distinct record count and percent: Finding the number of distinct records, when compared to the total number of records, can help identify any potential unique keys within a dataset.
  • NULL count and percent: Finding the number of null, blank, or zero (for some numeric columns) can help identify any records that may not meet a minimum standard of columns or any business rule violations.
  • String analysis: Finding the minimum, maximum, and average string lengths can help identify any column-level constraint violations and help design databases to accommodate string lengths.
  • Numerical and date range analysis: Finding the minimum and maximum numerical and date values is helpful to identify the appropriate data types and identify any business rule violations.

 With the basic data profiling take care off, more advanced analysis can be implemented such as:
  • Key integrity: Checking to be sure that any unique keys don’t have zero, blank, or null values along with identifying any orphaned keys can help eliminate any key issues.
  • Cardinality: Identifying any one-to-one, one-to-many, many-to-many relationships between the data sets is important modeling and BI tool set-up.
  • Pattern, frequency distributions, and domain analysis: Pattern identification can be useful to check if data fields are formatted correctly.

Selecting the right profiling technique will depend on a project’s objectives. If you’re building a new database from scratch, take the time to execute on each of the above bullet points. If you’re simply importing a new data set into an existing database, select the most applicable tasks that apply to your source data.

Nearly all of these tasks can be conducted by writing raw SQL. The basic profiling tasks can usually be accomplished using a tool designed specifically for data profiling. Many data profiling tools have been introduced into the marketplace over the last several years to help with the data profiling process.  

If your organization wants to use data profiling to improve data quality in an existing database or help design a system integration then shoot us an email. We would love to hear from you!

UPDATE:  Our beta version of our new tool to profile data within The Raiser's Edge has been released and it can be found on the Tucamino Solutions Website.


Saturday, November 22, 2014

Data Quality - What is it and how to improve it?

Data Quality can mean many different things to many different people in different organizations and industries.  But at its core, data quality is a perception or an assessment of data's fitness to serve its purpose in a given context. 

When we think of "good" data quality, we could think of the following aspects: 
1. Accuracy – Is the information in a database accurate, e.g.  is the address of a person an accurate address for the person.
2. Completeness – Is the information complete in a database, e.g. do we have first name, last name, phone number, email, and address of a person.
3. Relevancy – Is the information relevant for the business, e.g. is it relevant to capture employee information in a fundraising database (it depends rightJ).
4. Consistency – Is the information consistent in the database or across datasources, e.g. some records have a constituent code and some do not; some applications capture first name and last name as separate fields but others as 1 field together.
5. Reliability – Is the information reliable or can I trust the information, which is really an evaluation of the first four elements, but is the most important aspect.

Within an organization, acceptable data quality is crucial to effective business processes and to the reliability of business analytics and intelligence reporting.  If you think about it, the whole point of capturing data is to then do something with it, and usually that involves making what is hopefully smart business decisions.

Data quality is affected by the way data is entered, stored and managed.  Most of the time and effort that organizations spend on data quality is usually focused on the first component: data entry.  What results is an overly controlled environment where only a few people are allowed to enter data and overly manual processes are implemented because there is a belief that this tight control will lead to “good” data quality.  Any manual process where a human has to perform a task will result in errors (this is what to be human means).  As a result an organization ends up with processes that lead to inconsistent, incomplete, and unreliable data, in addition to “hit-by-a-bus” risks.

If organizations focus more on the last element, the management of information and data, the first two components, data entry and storage, can naturally and continuously be improved, leading to acceptable data quality which improves decision making.  This last component is often times called Data Quality Assurance (DQA).  DQA is the process of verifying the reliability and effectiveness of data. 

If organizations would refocus from a data entry to a data management approach, they will end up with a more effective, efficient, and better informed organization.  More effort is spent on automating tasks, improving operations, processes, and data quality, and training and education.  Organizational awareness and intelligence are created leading to a more informed decision-making engine. 

In order to implement an improvement to data quality, several steps need to take place (which is no small task but usually a large endeavor):
1.     Organizationally a plan needs to be put in place to identify operationally what they are trying to accomplish.  This can then lead to the creation of the necessary data models to support those operations.
2.     Applications and systems need to be updated to accommodate those data models.  Focus is spent on determining the minimal set of data needed for the organization to function in its mission at maximum capacity and any unnecessary data is excluded.
3.     Data Quality Assurance is then implemented.  This includes implementing the tools necessary to measure the aspects of data quality.
4.     All processes, especially those focused on information capture and data entry, need to be aligned with organizational goals and application data models.  This usually includes the development of new software tools such as data entry forms, non-manual ETL/system integrations, and reporting and BI tools.
5.     Training and education on applications and processes need to be developed and implemented.

In summary, good data quality is an important component to an effective organization. 
Organizations need to take a step back and ask themselves if they are an organization which is focused on good data quality.  Good data quality can’t be achieved overnight but with the right focus and plan data quality can be improved over time.

If you want help improving your organization’s data quality, check us out at Tucamino Solutions or send us an email  We would love to hear about your organization, its challenges, and help you improve your data quality.