Using R function in Vertica


Benefits

R programming language is a powerful analytical language especially used by data scientists for statistical analytics. R is a single threaded application which makes it challenging for big data implementation.

HP Vertica offers a robust and ever growing set of advanced database analytics functionality. It is a standards based relational database with full support for SQL, JDBC and ODBC.

The major advantage of using integrating R into Vertica is to utilize the power of parallel processing by implementing multiple instance of R and running R algorithms in parallel with queries that chunk the data independently. Vertica threads communicate with R processes to compute the results. It uses optimized data conversion from Vertica tables to R data frames and all R processing is automatically processed between Vertica servers. Picture below sourced from Vertica.com

vertica

Also, Vertica R implementation provides better performance since computation runs on a server instead of a client and have an optimized data flow channel between DB and R without the need of parsing again. It uses Udx framework instead of ODBC. Since Vertica is column store and R is vector based it is very efficient to move data from a Vertica column in very large blocks to R vectors.

Tools needed.

Install R package for Vertica. Community edition can be downloaded from my.vertica.com. The version must match with Vertica upto 3 places. x.xxx. Installation will guide in case if there is any dependency pending.

Usage

User Defined Functions can be created in R and can be used in Vertica. UDF’s will be working in fenced mode in a process outside of the main Vertica process. We need to create scalar functions and Transform Functions using R language. NULL values in Vertica are translated to NA values in R and vice versa. Data types are mapped between systems as follows:

HP Vertica Data Type R Data Type
Boolean logical
Date/Time:
DATE, DATETIME, SMALLDATETIME, TIME, TIMESTAMP, TIMESTAMPZ, TIMETZ
numeric
Approximate Numeric:
DOUBLE PRECISION, FLOAT, REAL
numeric
Exact Numeric:
BIGINT, DECIMAL, INT, NUMERIC, NUMBER, MONEY
numeric
BINARY, VARBINARY character
CHAR, VARCHAR character

The columns selected in Vertica are passed to R data.frame. The UDF  are loaded into Vertica using CREATE LIBRARY syntax and by writing R code that has two main parts,

# A main function – which is the actual function to do the business

# A factory function – which tells Vertica engine about the function type, input and output to expect.

Example :

In the .R file

mul<- function(x)
{
        pr <- x[,1] * x[,2]
        pr
}

mulFactory <- function()
{
        list(name=mul,udxtype=c(“scalar”),intype=c(“float”,”float”), outtype=c(“float”), outtypecallback=mulReturnType)
}

In Vertica

dbadmin=> CREATE LIBRARY mulLib AS ‘/usr/home/dbadmin/mul.R’ LANGUAGE ‘R’;
CREATE LIBRARY
dbadmin=> CREATE FUNCTION mul AS NAME ‘mulFactory’ LIBRARY schema1.mulLib;
CREATE FUNCTION

Calling function in Vertica

Dbadmin=> select price, qty, mul(price,qty)as amount from sales;

 

Source: My.Vertica.com, Vertica.com

Comment please...