Hadoop, Vertica, SSIS – Sandbox – Quick step by step


We are going to consider setting up a sandbox in a windows laptop (8.x or 10) to play around HP Vertica or to build a single node proof of concept.

Step 1. Download and install VMware player from vmware.com.  This is a virtualization tool which is free for non-commercial use.

Step 2: Download sandbox zip file. This is available free from HP marketplace. You need to create a free account in order to get access (https://haven.hpwsportal.com/catalog.html#/Product/%7B%22productId%22%3A%221826%22%7D/Show)

Once you download the 3 + GB zip file, unzip and place the ova file in your local windows machine.

Step 3:  Start vmware player and File->Open appliance, Then select the .ova sandbox image to import. This may take some time and will ask for server parameters such as memory, hard disk , processor etc.  Consider minimum needed and your resource availability. Once import is done, you can start/play the new virtual machine. This is a Centos image where Hadoop and Vertica are preconfigured.

Step 4: Once the system is started you will see black landing screen with an IP address mentioned there. Note that IP address. I will mention it as YourIP going forward in this document.

Step 5: Open a browser and navigate to http://YourIP . This will take you to a registration screen. Give the details and register yourself.

Step 6: Access Hue. Open a browser and navigate to http://YourIP:8000. The default userid/password is hue/1111

1

Step 7: Enable Ambari by clicking on the button “Enable” to the right side of “Ambari” in the list. Ambari will give you an insight on what is going on in the server.

http://YouIP:8080

2

Step 8: Connect Putty , give YourIP, port :22, SSH  . This will connect you to the linux machine. Login using root with password ‘hadoop’

Step 9: There is a default database created in vertica already “vive”. Start admin tool to initiate the database.

$ cd /opt/vertica/bin

$ su dbadmin

$ ./admintools

Use arrow keys and hit “Start Database”. Select the listed database (vive) by pressing spacebar which will print an ‘X’ next to VIVE. Give OK. When asked for password give ‘vive’ Wait until you see “Database vive started successfully”. You can go back to the admintools menu and see the database status by hitting on “View database Cluster state”

Hit exit to quit from the admin tools.

Step 10: You can start the demo by running the script available in the image.

$ cd /home/dbadmin/demo/vDemo

$ ./start_vDemo.sh

You can browse the demo database and query it using an online tool available.

http://YourIP:8111/home/hortonworks/hortonworks.html

3

Step 11: Connecting to Vertica from host Windows.

Download DBeaver database explorer tool which is available free to download.

Open a new connection , give YourIP as host , port:5433, database :vive

User :dbadmin

Password:vertica

4

Step 12: To load data from SQL Server using SSIS

Download Vertica ODBC Driver from My.Vertica.com. Register and Download windows ODBC 64 bit driver. Install the exe.

5

Open an SSIS package and connect Source SQL Server database, and setup transformation. You can create destination table using DBeaver. Select ODBC destination and create new dsn. Replace the IP with YourIP in the below screen. Map the Columns and Run the package.

6

7

Go to DBeaver and select the table. Good Luck.

Advertisements

Master Data Management Using SQL Server Master Data Services®


Data, being the key asset of any business, maintaining it properly is a key action towards business growth. Data grows over time, over business growth and over corporate actions like merger and acquisitions. Let’s have a close look on types of data we can find in any business organization. It can be categorized as below:

  1. Unorganized Data: This can be emails, documentations, write ups etc. which are basically unstructured or even bits and pieces of key data in excel sheets stored in different places.
  2. Lists: This is nothing but ‘master data’. Eg. Customer list, products, items, Item group, locations, vendors etc.
  3. Transaction records: These are records of any business activities, like sales, purchase, accounts transactions, payments etc.
  4. Meta Data: These are data about data. Configurations, database information etc.

Out of this lists, No.2 (Master Data) has the longer life time, high impact and higher reuse level.

Common Master Data headaches in an organization

  • Data Duplication: Same item is listed multiple times with slightly different spelling or in different format.
  • Data Integrity: Multiple lists of same data with contradicting attribute values. Ex. Same customer with different address, not sure which one is latest.
  • Data Inaccuracy: Data quality below the accepted standard.
  • Invalid Data: Zip code missing, SSN with fewer digits etc.
  • Privacy issues: Customer data can be accessed by different applications within the organization and partner organizations. Lack of privacy measures to secure the data asset.
  • Data Inconsistency: Due to complex enterprise database design, one attribute can be there in multiple places and data update is not syncing everywhere promptly.
  • Lack of proper validation against business rules: Systems are built over time that which the master data copy is not validated against a centralized business rule.

Importance of MDM (Master Data Management)

MDM is a structured way of storing and maintaining master data in an organization. MDM offers organization a single version of truth. It brings proper discipline to data management and coordination between different systems. It simplifies the process of integrating systems on an event of corporate merger and acquisitions. Moreover it simplifies the life of other application development and data warehousing team by taking the burden of cleaning, managing and maintaining the master data. That way it reduces overall effort and saves dollar.

2015-07-03 11_18_32-MDS.pptx [Read-Only] - Microsoft PowerPoint

MDM implementations phases

Implementing MDM is a tedious process as it takes the burden of other implementations and offers a clean enterprise master data backbone. Implementing MDM is not an isolated task, before building the system one has to implement an organizational process and policy in order to prepare the organization to strictly follow the discipline of proper master data usage going forward. This will also help developers and other technical and business users understand about the new master data repository and the methods to connect to it.

1. Understanding Data

All master data sources have to be identified. This can be done through conventional system analysis process and needs to be validated against a set of criteria which will determine if it can be considered as a master data. These criteria are based on factors like data behavior, life cycle, cardinality, lifetime, complexity, value, volatility and reuse. The CRUD (create, read, update and delete) cycle nature of the data will influence the selection of data dimensions as well as data modelling.

Also it is important to identify the producers and consumers of each data set. The metadata has to be analyzed and documented.

2. Define Organizational Process

There has to be a data steward appointed for each data area. This will help getting answers during the implementation phase as well as he can be an authority going forward to approve access, to make decision on system changes or additions , and to answering questions.

A data governance policy will be formulated which will form a data governance committee. This committee will be coordinating individual data stewards , updating data governance policies time to time , decides on how long data is kept, manage change authorization, auditing  and data access permission policies.

3. Build, test and maintain

Data attributes are decided based on the overall analysis and master data model is architected. Hierarchies and relationships within the master data are also identified. Appropriate technology is selected, infrastructure is built and system is implemented with initial and incremental data load in place. An exclusive test procedure also is established for making sure the data quality. This is very important as the wrong data will affect different applications and rollback will be tricky.

The tool selected for implementing MDM should also support granular security for handling privacy regulations and data security, transaction, data versioning & auditing, business rules integration etc.

 

MDM Approach

The most popular practice for MDM architecture is a hybrid approach where some of the data will be maintained just a single copy across the organization, and the changes will be handled centrally where as certain data sets will be maintained multiple copies and will be synched through a central mechanism. The transactional hub architecture with MicorSoft Master Data Service enable us to choose between repository, registry and hybrid options.

SQL Server 2012 Master Data Service

Microsoft acquired Stratature’s +EDM – a Master Data Management Tool and integrated it along with SQL Server 2008 R2 initially. This product has been integrated with Microsoft’s office business platforms and applications. It has interoperability with a wide range of popular tools like MS SQL Server ,MS SharePoint, WCF, SSIS, BizTalk ,ADO.Net etc. which increases its strength and acceptability.

SQL Server 2012 version improved its analytical capabilities, transactional capabilities, and integration with other tools and technologies.

SQL Server MDS Components

Configuration Manager: Tool for creating and configure Master Data Service.

Master Data Manager: Web application to create models, business rules etc.

MDS Model deployment tool: This is used to create packages to deploy model objects and data.

MDS Web Service: Developers can use this to build custom solutions for MDS.

MDS Ad-in for Excel: This will give the flexibility to work with the popular excel interface to manage data and to create new entities and attributes.

 

In summary, by choosing MDS and doing a proper implementation, Microsoft offers better data quality which in turn provides an ROI by reducing time and effort spent in data quality fixes and work around. It also enables business to correct the data which will reduce the impact time and will empower the business for key business decisions.

Principles of Mathematics by Russel – 1903


“The fact that all Mathematics is Symbolic Logic is one of the greatest discoveries of our age; and when this fact has been established, the remainder of the principles of mathematics consists in the analysis of Symbolic Logic itself” – Betrand Russel

The great  Principles of Mathematics is deservedly the most famous work ever written on the foundations of mathematics. Its aim is to deduce all the fundamental propositions of logic and mathematics from a small number of logical premises and primitive ideas, and so to prove that mathematics is a development of logic.

A good read for a computer programmer . Click here to download