Data about ‘Quality of Data’ & Intro to Data profiling task in SSIS 2012


The quality of data in a data warehouse is very important in any DW environment. It can even be highly critical in some scenario. Analysing large volume of source data for quality confirmation has always been a challenge for any data warehouse architect. Data quality and source data management is a key for the success of the data warehouse. It is considered as the reason for more than 75% of failed DW projects are related to source data quality issues or source data management issues. ISO 9001:2008 framework was used a lot for managing Source Data Quality.

There are different strategies to address source data quality issues which are mentioned below in order,

1. Revert back : Get back to the source data management team for dealing it there.

2. Fill the gap : Known issues with the data can be addressed while data transformation.

3. Dump the garbage : If there is no way to recover certain data to its minimum required quality, then avoid that from getting into warehouse after proper auditing and reconciliation.

For performing the above steps , it is very important to asses the data quality accurately. Belo mentioned is an explanation about the usage of Data profiling task in SSIS 2012.

ScreenShot_2013-02-10_18-19-01    ScreenShot_2013-02-10_18-18-53

First step is to provide a log file name using the normal file connection, we can also input this to a variable.

ScreenShot_2013-02-10_18-18-09

Then we should select a profile request from the available items,

Select Item Type Description
Column Length Distribution Profile Individual Columns Reports all the distinct lengths of string values in the selected column and the percentage of rows in the table that each length represents.
Column Null Ratio Profile Individual Columns Reports the percentage of null values in the selected column.
Column Pattern Profile Individual Columns Reports a set of regular expressions that cover the specified percentage of values in a string column.
Column Statistics Profile Individual Columns Reports statistics, such as minimum, maximum, average, and standard deviation for numeric columns, and minimum and maximum for datetime columns.
Column Value Distribution Profile Individual Columns Reports all the distinct values in the selected column and the percentage of rows in the table that each value represents. Can also report values that represent more than a specified percentage of rows in the table.
Candidate Key Profile Multiple columns and relationships Reports whether a column or set of columns is a key, or an approximate key, for the selected table.
Functional Dependency Profile Multiple columns and relationships Reports the extent to which the values in one column (the dependent column) depend on the values in another column or set of columns (the determinant column).
Value Inclusion Profile Multiple columns and relationships Computes the overlap in the values between two columns or sets of columns. This profile can determine whether a column or set of columns is appropriate to serve as a foreign key between the selected tables.

ScreenShot_2013-02-10_18-31-35

The limitation I see here is , we can only connect using ADO.Net connection.

A quick profiler view button is there for a one time view.A Quick Profile profiles a table or view by using all the default profiles and default settings. Also a profile viewer application can be launched from the task properties screen for viewing  and analysing the profile.

Extract XML data from SQL table


Solution proposed by Susheel :

 

DECLARE @tmp TABLE

(

  txt xml

)

INSERT INTO @tmp (txt)

select AlertXML FROM [AP].[dbo].[AlertInfo]

SELECT txt.value(‘(//SPREADINFO/SPREAD_CODE)[1]’, ‘nvarchar(50)’) SPREAD_CODE,txt.value(‘(//SPREADINFO/SPREAD_DESC)[1]’, ‘nvarchar(50)’) SPREAD_DESC

FROM   @tmp

 

Select * from @tmp

 

This will give the result in a table format and then can passed to ssis pipeline.

 

Other Solution,

Put a foreach loop and navigate row by row. fill the xml data to an object type package variable, then a DFT –> xml data source can pick data from the variable and pass to the destination.

Parallel execution in SSIS


SSIS allows parallel execution in two different ways. 

2 proporties controll the parallel execution of SSIS tasks. 

MaxConcurrentExecutables, – This is a property of the package. It defines how many tasks (executables) can be executed simultaneously. It defaults to -1 which is translated to the number of processors plus 2. While doing this, you server hashyperthreading turned on, it is the logical processor rather than the physically present processor that is counted .