HP vertica simple query optimization steps.


If you consider the basic steps to improve performance , please follow the below steps.

  1. Make sure tables are partitioned in a balanced way . Too many partitions are not good.The partision column needs to be set as NOT NULL. you can do it at any point using an ALTER TABLE command.
  2. Partition can be done by issuing an ALTER TABLE command
  3. PURGE_TABLE() for any purging of pending changes.
  4. Initially go for a projection update . You can select a complete initial projection without a specific script. This will make sure all the columns are included with super projections.
  5. Next step is to go and create a projection for the specific query using database designer tool. This should be done against incremental projections and deploy the projections.

 

Query Results by pages – Query Caching Method


Pagination of query results for web application to show data by pages is a very common requirement across all types of application. Here is a solution to overcome certain common issues that we have today . The very popular solution in MySQL for this now a days is to use OFFSET and LIMIT with the select queries. The disadvantage  with using OFFSET and LIMIT is each time when user request for a page, database engine process the entire data, sort it and calculate the particular page user asks for and return. In order to avoid this repeated full data process, we are doing a full data process first time , cache it and reuse the data. We also manage the cache to expire it appropriately.

We are going to achieve this using a search stored procedure and cache index table. The cache index table keeps the active sessions of cache.

create table paginationHdr (sessionid varchar(32),queryid int,temptable varchar(100),whereclause varchar(1000))

We keep the sessionid here to identify users session. Queries can be written inside using switch case , so that the procedure can be used for multiple queries. The best way is to keep this in a separate table so that we do not need to modify the procedure each time we insert a query. temptable field is nothing but the cache table name for the particular call.

Another thing is to keep in mind is that, when a web session expires, make sure you do the following things.

  1. drop all cache tables created for the expiring session ( Select temptable from paginationHdr where sessionid ='<session>’;)
  2. delete all records from the cache index table (paginationHdr) with the expiring sessionid

The procedure must be called as –

CALL spSearch(11,’12345678901234567890123456789012′,0,1,’ AND 1=1 ‘,’nameColumn’);

This stored procedure is implemented in MySQL but you can translate it pretty much to any relational database like SQL Server etc.

Pagination

delimiter //
create procedure spSearch(p_queryid int, p_sessionid varchar (32),p_refresh bit,p_page int,p_whereclause varchar(1000),p_orderby varchar(100))
BEGIN
Set @v_pagesize = 3;
Set @v_basequery =null;
Set @s=null;

CASE p_queryid
WHEN 1 THEN
SET @v_basequery = ‘ SELECT idColumn,nameColumn, @curRank := @curRank + 1 AS rnk99, 0 AS pg99 FROM table1,(SELECT @curRank := 0) r WHERE active=1 ‘;

WHEN 2 THEN
SET @v_basequery = ‘ SELECT idColumn,nameColumn, @curRank := @curRank + 1 AS rnk99, 0 AS pg99 FROM table2,(SELECT @curRank := 0) r WHERE active=1 ‘;

END CASE;

Set @v_temptable=null;
SELECT temptable INTO @v_temptable FROM paginationHdr WHERE sessionid=p_sessionid AND 0=p_refresh AND queryid=p_queryid AND whereclause = p_whereclause;

IF @v_temptable is null THEN — This is for first time data pulling
Select CONCAT(‘TEMP’,SUBSTRING(MD5(RAND()) FROM 1 FOR 10),left(p_sessionid,5) ) into @v_temptable ;

LoopStart: LOOP
Set @tabletodelete=null;
Select temptable INTO @tabletodelete from paginationHdr Where sessionid=p_sessionid And queryid = p_queryid And whereclause=p_whereclause limit 1;

If @tabletodelete is null Then
leave LoopStart;
END IF;

SET @s = CONCAT(‘Drop table ‘ , @tabletodelete, ‘;’ );

PREPARE stmt FROM @s;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;

Delete from paginationHdr Where sessionid=p_sessionid And queryid = p_queryid And whereclause=p_whereclause And temptable=@tabletodelete;
END Loop;

Insert into paginationHdr Values (p_sessionid,p_queryid,@v_temptable, p_whereclause) ;

SET @s = CONCAT(‘Create table ‘ , @v_temptable , @v_basequery ,p_whereclause,’ order by ‘,p_orderby,’;’ );

PREPARE stmt FROM @s;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;

SET @s = CONCAT(‘Update ‘ , @v_temptable , ‘ Set pg99= CASE WHEN MOD(rnk99,’,@v_pagesize,’)=0 Then rnk99 DIV ‘,@v_pagesize,’ WHEN MOD(rnk99,’,@v_pagesize,’) > 0 THEN rnk99 DIV ‘,@v_pagesize,’ +1 END ;’ );

PREPARE stmt FROM @s;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;
END IF;

SET @s = CONCAT(‘Select * from ‘ , @v_temptable , ‘ Where pg99=’ ,p_page,’;’ );

PREPARE stmt FROM @s;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;

END//

 

 

Using R function in Vertica


Benefits

R programming language is a powerful analytical language especially used by data scientists for statistical analytics. R is a single threaded application which makes it challenging for big data implementation.

HP Vertica offers a robust and ever growing set of advanced database analytics functionality. It is a standards based relational database with full support for SQL, JDBC and ODBC.

The major advantage of using integrating R into Vertica is to utilize the power of parallel processing by implementing multiple instance of R and running R algorithms in parallel with queries that chunk the data independently. Vertica threads communicate with R processes to compute the results. It uses optimized data conversion from Vertica tables to R data frames and all R processing is automatically processed between Vertica servers. Picture below sourced from Vertica.com

vertica

Also, Vertica R implementation provides better performance since computation runs on a server instead of a client and have an optimized data flow channel between DB and R without the need of parsing again. It uses Udx framework instead of ODBC. Since Vertica is column store and R is vector based it is very efficient to move data from a Vertica column in very large blocks to R vectors.

Tools needed.

Install R package for Vertica. Community edition can be downloaded from my.vertica.com. The version must match with Vertica upto 3 places. x.xxx. Installation will guide in case if there is any dependency pending.

Usage

User Defined Functions can be created in R and can be used in Vertica. UDF’s will be working in fenced mode in a process outside of the main Vertica process. We need to create scalar functions and Transform Functions using R language. NULL values in Vertica are translated to NA values in R and vice versa. Data types are mapped between systems as follows:

HP Vertica Data Type R Data Type
Boolean logical
Date/Time:
DATE, DATETIME, SMALLDATETIME, TIME, TIMESTAMP, TIMESTAMPZ, TIMETZ
numeric
Approximate Numeric:
DOUBLE PRECISION, FLOAT, REAL
numeric
Exact Numeric:
BIGINT, DECIMAL, INT, NUMERIC, NUMBER, MONEY
numeric
BINARY, VARBINARY character
CHAR, VARCHAR character

The columns selected in Vertica are passed to R data.frame. The UDF  are loaded into Vertica using CREATE LIBRARY syntax and by writing R code that has two main parts,

# A main function – which is the actual function to do the business

# A factory function – which tells Vertica engine about the function type, input and output to expect.

Example :

In the .R file

mul<- function(x)
{
        pr <- x[,1] * x[,2]
        pr
}

mulFactory <- function()
{
        list(name=mul,udxtype=c(“scalar”),intype=c(“float”,”float”), outtype=c(“float”), outtypecallback=mulReturnType)
}

In Vertica

dbadmin=> CREATE LIBRARY mulLib AS ‘/usr/home/dbadmin/mul.R’ LANGUAGE ‘R’;
CREATE LIBRARY
dbadmin=> CREATE FUNCTION mul AS NAME ‘mulFactory’ LIBRARY schema1.mulLib;
CREATE FUNCTION

Calling function in Vertica

Dbadmin=> select price, qty, mul(price,qty)as amount from sales;

 

Source: My.Vertica.com, Vertica.com