The quality of data in a data warehouse is very important in any DW environment. It can even be highly critical in some scenario. Analysing large volume of source data for quality confirmation has always been a challenge for any data warehouse architect. Data quality and source data management is a key for the success of the data warehouse. It is considered as the reason for more than 75% of failed DW projects are related to source data quality issues or source data management issues. ISO 9001:2008 framework was used a lot for managing Source Data Quality.
There are different strategies to address source data quality issues which are mentioned below in order,
1. Revert back : Get back to the source data management team for dealing it there.
2. Fill the gap : Known issues with the data can be addressed while data transformation.
3. Dump the garbage : If there is no way to recover certain data to its minimum required quality, then avoid that from getting into warehouse after proper auditing and reconciliation.
For performing the above steps , it is very important to asses the data quality accurately. Belo mentioned is an explanation about the usage of Data profiling task in SSIS 2012.
First step is to provide a log file name using the normal file connection, we can also input this to a variable.
Then we should select a profile request from the available items,
|Column Length Distribution Profile||Individual Columns||Reports all the distinct lengths of string values in the selected column and the percentage of rows in the table that each length represents.|
|Column Null Ratio Profile||Individual Columns||Reports the percentage of null values in the selected column.|
|Column Pattern Profile||Individual Columns||Reports a set of regular expressions that cover the specified percentage of values in a string column.|
|Column Statistics Profile||Individual Columns||Reports statistics, such as minimum, maximum, average, and standard deviation for numeric columns, and minimum and maximum for datetime columns.|
|Column Value Distribution Profile||Individual Columns||Reports all the distinct values in the selected column and the percentage of rows in the table that each value represents. Can also report values that represent more than a specified percentage of rows in the table.|
|Candidate Key Profile||Multiple columns and relationships||Reports whether a column or set of columns is a key, or an approximate key, for the selected table.|
|Functional Dependency Profile||Multiple columns and relationships||Reports the extent to which the values in one column (the dependent column) depend on the values in another column or set of columns (the determinant column).|
|Value Inclusion Profile||Multiple columns and relationships||Computes the overlap in the values between two columns or sets of columns. This profile can determine whether a column or set of columns is appropriate to serve as a foreign key between the selected tables.|
The limitation I see here is , we can only connect using ADO.Net connection.
A quick profiler view button is there for a one time view.A Quick Profile profiles a table or view by using all the default profiles and default settings. Also a profile viewer application can be launched from the task properties screen for viewing and analysing the profile.