Benefits
R programming language is a powerful analytical language especially used by data scientists for statistical analytics. R is a single threaded application which makes it challenging for big data implementation.
HP Vertica offers a robust and ever growing set of advanced database analytics functionality. It is a standards based relational database with full support for SQL, JDBC and ODBC.
The major advantage of using integrating R into Vertica is to utilize the power of parallel processing by implementing multiple instance of R and running R algorithms in parallel with queries that chunk the data independently. Vertica threads communicate with R processes to compute the results. It uses optimized data conversion from Vertica tables to R data frames and all R processing is automatically processed between Vertica servers. Picture below sourced from Vertica.com
Also, Vertica R implementation provides better performance since computation runs on a server instead of a client and have an optimized data flow channel between DB and R without the need of parsing again. It uses Udx framework instead of ODBC. Since Vertica is column store and R is vector based it is very efficient to move data from a Vertica column in very large blocks to R vectors.
Tools needed.
Install R package for Vertica. Community edition can be downloaded from my.vertica.com. The version must match with Vertica upto 3 places. x.xxx. Installation will guide in case if there is any dependency pending.
Usage
User Defined Functions can be created in R and can be used in Vertica. UDF’s will be working in fenced mode in a process outside of the main Vertica process. We need to create scalar functions and Transform Functions using R language. NULL values in Vertica are translated to NA values in R and vice versa. Data types are mapped between systems as follows:
HP Vertica Data Type | R Data Type |
Boolean | logical |
Date/Time: DATE, DATETIME, SMALLDATETIME, TIME, TIMESTAMP, TIMESTAMPZ, TIMETZ |
numeric |
Approximate Numeric: DOUBLE PRECISION, FLOAT, REAL |
numeric |
Exact Numeric: BIGINT, DECIMAL, INT, NUMERIC, NUMBER, MONEY |
numeric |
BINARY, VARBINARY | character |
CHAR, VARCHAR | character |
The columns selected in Vertica are passed to R data.frame. The UDF are loaded into Vertica using CREATE LIBRARY syntax and by writing R code that has two main parts,
# A main function – which is the actual function to do the business
# A factory function – which tells Vertica engine about the function type, input and output to expect.
Example :
In the .R file
mul<- function(x)
{
pr <- x[,1] * x[,2]
pr
}
mulFactory <- function()
{
list(name=mul,udxtype=c(“scalar”),intype=c(“float”,”float”), outtype=c(“float”), outtypecallback=mulReturnType)
}
In Vertica
dbadmin=> CREATE LIBRARY mulLib AS ‘/usr/home/dbadmin/mul.R’ LANGUAGE ‘R’;
CREATE LIBRARY
dbadmin=> CREATE FUNCTION mul AS NAME ‘mulFactory’ LIBRARY schema1.mulLib;
CREATE FUNCTION
Calling function in Vertica
Dbadmin=> select price, qty, mul(price,qty)as amount from sales;
Source: My.Vertica.com, Vertica.com