Hadoop, Vertica, SSIS – Sandbox – Quick step by step


We are going to consider setting up a sandbox in a windows laptop (8.x or 10) to play around HP Vertica or to build a single node proof of concept.

Step 1. Download and install VMware player from vmware.com.  This is a virtualization tool which is free for non-commercial use.

Step 2: Download sandbox zip file. This is available free from HP marketplace. You need to create a free account in order to get access (https://haven.hpwsportal.com/catalog.html#/Product/%7B%22productId%22%3A%221826%22%7D/Show)

Once you download the 3 + GB zip file, unzip and place the ova file in your local windows machine.

Step 3:  Start vmware player and File->Open appliance, Then select the .ova sandbox image to import. This may take some time and will ask for server parameters such as memory, hard disk , processor etc.  Consider minimum needed and your resource availability. Once import is done, you can start/play the new virtual machine. This is a Centos image where Hadoop and Vertica are preconfigured.

Step 4: Once the system is started you will see black landing screen with an IP address mentioned there. Note that IP address. I will mention it as YourIP going forward in this document.

Step 5: Open a browser and navigate to http://YourIP . This will take you to a registration screen. Give the details and register yourself.

Step 6: Access Hue. Open a browser and navigate to http://YourIP:8000. The default userid/password is hue/1111

1

Step 7: Enable Ambari by clicking on the button “Enable” to the right side of “Ambari” in the list. Ambari will give you an insight on what is going on in the server.

http://YouIP:8080

2

Step 8: Connect Putty , give YourIP, port :22, SSH  . This will connect you to the linux machine. Login using root with password ‘hadoop’

Step 9: There is a default database created in vertica already “vive”. Start admin tool to initiate the database.

$ cd /opt/vertica/bin

$ su dbadmin

$ ./admintools

Use arrow keys and hit “Start Database”. Select the listed database (vive) by pressing spacebar which will print an ‘X’ next to VIVE. Give OK. When asked for password give ‘vive’ Wait until you see “Database vive started successfully”. You can go back to the admintools menu and see the database status by hitting on “View database Cluster state”

Hit exit to quit from the admin tools.

Step 10: You can start the demo by running the script available in the image.

$ cd /home/dbadmin/demo/vDemo

$ ./start_vDemo.sh

You can browse the demo database and query it using an online tool available.

http://YourIP:8111/home/hortonworks/hortonworks.html

3

Step 11: Connecting to Vertica from host Windows.

Download DBeaver database explorer tool which is available free to download.

Open a new connection , give YourIP as host , port:5433, database :vive

User :dbadmin

Password:vertica

4

Step 12: To load data from SQL Server using SSIS

Download Vertica ODBC Driver from My.Vertica.com. Register and Download windows ODBC 64 bit driver. Install the exe.

5

Open an SSIS package and connect Source SQL Server database, and setup transformation. You can create destination table using DBeaver. Select ODBC destination and create new dsn. Replace the IP with YourIP in the below screen. Map the Columns and Run the package.

6

7

Go to DBeaver and select the table. Good Luck.

Master Data Management Using SQL Server Master Data Services®


Data, being the key asset of any business, maintaining it properly is a key action towards business growth. Data grows over time, over business growth and over corporate actions like merger and acquisitions. Let’s have a close look on types of data we can find in any business organization. It can be categorized as below:

  1. Unorganized Data: This can be emails, documentations, write ups etc. which are basically unstructured or even bits and pieces of key data in excel sheets stored in different places.
  2. Lists: This is nothing but ‘master data’. Eg. Customer list, products, items, Item group, locations, vendors etc.
  3. Transaction records: These are records of any business activities, like sales, purchase, accounts transactions, payments etc.
  4. Meta Data: These are data about data. Configurations, database information etc.

Out of this lists, No.2 (Master Data) has the longer life time, high impact and higher reuse level.

Common Master Data headaches in an organization

  • Data Duplication: Same item is listed multiple times with slightly different spelling or in different format.
  • Data Integrity: Multiple lists of same data with contradicting attribute values. Ex. Same customer with different address, not sure which one is latest.
  • Data Inaccuracy: Data quality below the accepted standard.
  • Invalid Data: Zip code missing, SSN with fewer digits etc.
  • Privacy issues: Customer data can be accessed by different applications within the organization and partner organizations. Lack of privacy measures to secure the data asset.
  • Data Inconsistency: Due to complex enterprise database design, one attribute can be there in multiple places and data update is not syncing everywhere promptly.
  • Lack of proper validation against business rules: Systems are built over time that which the master data copy is not validated against a centralized business rule.

Importance of MDM (Master Data Management)

MDM is a structured way of storing and maintaining master data in an organization. MDM offers organization a single version of truth. It brings proper discipline to data management and coordination between different systems. It simplifies the process of integrating systems on an event of corporate merger and acquisitions. Moreover it simplifies the life of other application development and data warehousing team by taking the burden of cleaning, managing and maintaining the master data. That way it reduces overall effort and saves dollar.

2015-07-03 11_18_32-MDS.pptx [Read-Only] - Microsoft PowerPoint

MDM implementations phases

Implementing MDM is a tedious process as it takes the burden of other implementations and offers a clean enterprise master data backbone. Implementing MDM is not an isolated task, before building the system one has to implement an organizational process and policy in order to prepare the organization to strictly follow the discipline of proper master data usage going forward. This will also help developers and other technical and business users understand about the new master data repository and the methods to connect to it.

1. Understanding Data

All master data sources have to be identified. This can be done through conventional system analysis process and needs to be validated against a set of criteria which will determine if it can be considered as a master data. These criteria are based on factors like data behavior, life cycle, cardinality, lifetime, complexity, value, volatility and reuse. The CRUD (create, read, update and delete) cycle nature of the data will influence the selection of data dimensions as well as data modelling.

Also it is important to identify the producers and consumers of each data set. The metadata has to be analyzed and documented.

2. Define Organizational Process

There has to be a data steward appointed for each data area. This will help getting answers during the implementation phase as well as he can be an authority going forward to approve access, to make decision on system changes or additions , and to answering questions.

A data governance policy will be formulated which will form a data governance committee. This committee will be coordinating individual data stewards , updating data governance policies time to time , decides on how long data is kept, manage change authorization, auditing  and data access permission policies.

3. Build, test and maintain

Data attributes are decided based on the overall analysis and master data model is architected. Hierarchies and relationships within the master data are also identified. Appropriate technology is selected, infrastructure is built and system is implemented with initial and incremental data load in place. An exclusive test procedure also is established for making sure the data quality. This is very important as the wrong data will affect different applications and rollback will be tricky.

The tool selected for implementing MDM should also support granular security for handling privacy regulations and data security, transaction, data versioning & auditing, business rules integration etc.

 

MDM Approach

The most popular practice for MDM architecture is a hybrid approach where some of the data will be maintained just a single copy across the organization, and the changes will be handled centrally where as certain data sets will be maintained multiple copies and will be synched through a central mechanism. The transactional hub architecture with MicorSoft Master Data Service enable us to choose between repository, registry and hybrid options.

SQL Server 2012 Master Data Service

Microsoft acquired Stratature’s +EDM – a Master Data Management Tool and integrated it along with SQL Server 2008 R2 initially. This product has been integrated with Microsoft’s office business platforms and applications. It has interoperability with a wide range of popular tools like MS SQL Server ,MS SharePoint, WCF, SSIS, BizTalk ,ADO.Net etc. which increases its strength and acceptability.

SQL Server 2012 version improved its analytical capabilities, transactional capabilities, and integration with other tools and technologies.

SQL Server MDS Components

Configuration Manager: Tool for creating and configure Master Data Service.

Master Data Manager: Web application to create models, business rules etc.

MDS Model deployment tool: This is used to create packages to deploy model objects and data.

MDS Web Service: Developers can use this to build custom solutions for MDS.

MDS Ad-in for Excel: This will give the flexibility to work with the popular excel interface to manage data and to create new entities and attributes.

 

In summary, by choosing MDS and doing a proper implementation, Microsoft offers better data quality which in turn provides an ROI by reducing time and effort spent in data quality fixes and work around. It also enables business to correct the data which will reduce the impact time and will empower the business for key business decisions.

Microsoft in-memory avathar of BI–SSAS Tabular modelling


SQL 2012 comes with more feature for flexibility ,security and scalability,

  • tabular model or multidimensional model to choose from.
  • Queries and logics using DAX or MDX
  • Storage options – cached and pass-through
  • row or cell level security
  • Vertipaq for high performance and direct query for real time – all with new compression algorithms

Let us look on deeply(information perspective only) to the tabular modelling of SSAS 2012 in the below note.

Tabular model would be a more familiar model for architects and developers  which will enable them to build easily and a faster journey to the solution. Complex modelling concepts will not be available here directly. The primary use of this is to have an analytics wrapper for a simple relational database.

To start with a tabular project

In the BIDS , once we select SSAS there will have 2 projects, one is SSAS multidimensional and the other one is tabular project. There will come a Model.bim file in the solution explorer. Select the file and see the properties of the model. There is a DirectQuery Mode property there which will decide the model should go in memory or not. Direct Query mode if OFF means in-memory mode is ON.

Import from Data Source menu of the model is used to import data tables . Select tables and views by giving enough credential information . We dont have to take all the columns from the table instead we can choose columns by using filter the table data.

We need to mart the date table by selecting Mark as date table option in the date table tab of the model designer to enable time intelligence functions.

DAX

The Data Analysis Expressions (DAX) language is a new formula language that is considered an extension of the formula language in Excel. The DAX statements operate against an in-memory relational data store, comprised of tables and relationships in the PowerPivot workbook. DAX is used to create custom measures and calculated columns.

  • DAX expressions cannot be used to create new rows, only to create new values in columns or measures based on existing data.
  • DAX is not a query language; it is an expression language that is embedded within the MDX statements that are passed to an in-process instance of Analysis Services.
  • Use of DAX expressions is supported only within PowerPivot for Excel.We cannot use measures created by a DAX expression in an instance of Analysis Services that supports traditional OLAP.

Hardware considerations

All in-memory database technologies require large amounts of RAM, so it comes as no surprise that maximizing RAM should be your top priority. Whether plan is to run multiple small to medium size solutions, or just one very large solution, begin hardware search by looking at systems that offer the most RAM that can be afforded.

Because query performance is best when the tabular solution fits in memory, RAM must be sufficient to store the entire database. Depending on how you plan to manage and use the database, additional memory up to two or three times the size of your database might be needed for processing, disaster recovery, and for queries against very large datasets that require temporary tables to perform the calculation.

Disk I/O is not a primary factor in sizing hardware for a tabular solution, as the model is optimized for in-memory storage and data access.

image

Ref: microsoft white papers, microsoft documentation and solution