LightDB-A Database R Client
This chapter contains the following information:
- About LightDB-A R
- Supported Platforms
- Prerequisites
- Installing the LightDB-A R Client
- Example Data Sets
- Using the LightDB-A R Client
- Limitations
- Function Summary
About LightDB-A R
The LightDB-A R Client (LightDB-AR) is an interactive in-database data analytics tool for LightDB-A Database. The client provides an R interface to tables and views, and requires no SQL knowledge to operate on these database objects.
You can use LightDB-AR with the LightDB-A PL/R procedural language to run an R function on data stored in LightDB-A Database. LightDB-AR parses the R function and creates a user-defined function (UDF) for execution in LightDB-A. LightDB-A runs the UDF in parallel on the segment hosts.
You can similarly use LightDB-AR with LightDB-A PL/Container 3 (Beta), to run an R function against LightDB-A data in a high-performance R sandbox runtime environment.
No analytic data is loaded into R when you use LightDB-AR, a key requirement when dealing with large data sets. Only the R function and minimal data is transferred between R and LightDB-A.
Supported Platforms
LightDB-AR supports the following component versions:
LightDB-AR Version | R Version | LightDB-A Version | PL/R Version | PL/Container Version |
---|---|---|---|---|
1,1,0, 1.0.0 | 3.6+ | 6.1+ | 3.0.2+ | 3.0.0 (Beta) |
Prerequisites
You can use LightDB-AR with LightDB-A Database and the PL/R and PL/Container procedural languages. Before you install and run LightDB-AR on a client system:
- Ensure that your LightDB-A Database installation is running version 6.1 or newer.
- Ensure that your client development system has connectivity to the LightDB-A Database coordinator host.
- Ensure that
R
version 3.6.0 or newer is installed on your client system, and that you set the$R_HOME
environment variable appropriately. - Determine the procedural language(s) you plan to use with LightDB-AR, and ensure that the language(s) is installed and configured in your LightDB-A Database cluster. Refer to PL/R Language and PL/Container Language (3.0 Beta) for language installation and configuration instructions.
Verify that you have registered the procedural language(s) in each database in which you plan to use LightDB-AR to read data from or write data to LightDB-A. For example, the following command lists the extensions and languages registered in the database named
testdb
:$ psql -h gpcoordinator -d testdb -c '\dx' List of installed extensions Name | Version | Schema | Description -------------+----------+------------+---------------------------------------------------------------- plcontainer | 1.0.0 | public | GPDB execution sandboxing for Python and R plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language plr | 8.3.0.16 | public | load R interpreter and run R script from within a database
Check for the
plr
and/orplcontainer
extensionName
.
Installing the LightDB-A R Client
LightDB-AR is an R package. You obtain the package from VMware Tanzu Network (or the LightDB-AR github
repository) and install the package within the R console.
Download the package from the LightDB-A Procedural Languages filegroup on VMware Tanzu Network. The naming format of the downloaded file is
greenplumR‑<version>‑gp6.tar.gz
.Note the file system location of the downloaded file.
Follow the instructions in Verifying the LightDB-A Database Software Download to verify the integrity of the LightDB-A Procedural Languages LightDB-AR software.
Install the dependent R packages:
ini
,shiny
, andRPostgreSQL
. For example, enter the R console and run the following, or equivalent, command:user@clientsys$ R > install.packages(c("ini", "shiny", "RPostgreSQL"))
You may be prompted to select a CRAN download mirror. And, depending on your client system configuration, you may be prompted to use or create a personal library.
After downloading, R builds and installs these and dependent packages.
Install the LightDB-AR R package:
> install.packages("/path/to/greenplumR‑<version>‑gp6.tar.gz", repos = NULL, type = "source")
Install the
Rdpack
documentation package:> install.packages("Rdpack")
Example Data Sets
LightDB-AR includes the abalone
and null.data
sample datasets. Refer to the reference pages for more information:
> help( abalone )
> help( null.data )
Using the LightDB-A R Client
You use LightDB-AR to perform in-database analytics. Typical operations that you may perform include:
- Loading the LightDB-AR package.
- Connecting to and disconnecting from LightDB-A Database.
- Examining database objects.
- Analyzing and manipulating data.
- Running R functions in LightDB-A Database.
Loading LightDB-AR
Use the R library()
function to load LightDB-AR:
user@clientsys$ R
> library("LightDB-AR")
Connecting to LightDB-A Database
The db.connect()
and db.connect.dsn()
LightDB-AR functions establish a connection to LightDB-A Database. The db.disconnect()
function closes a database connection.
The LightDB-AR connect and disconnect function signatures follow:
db.connect( host = "localhost", user = Sys.getenv("USER"),
dbname = user, password = "", port = 5432, conn.pkg = "RPostgreSQL",
default.schemas = NULL, verbose = TRUE, quick = FALSE )
db.connect.dsn( dsn.key, db.ini = "~/db.ini", default.schemas = NULL,
verbose = TRUE, quick = FALSE )
db.disconnect( conn.id = 1, verbose = TRUE, force = FALSE )
When you connect to LightDB-A Database, you provide the coordinator host, port, database name, user name, password, and other information via function arguments or a data source name (DSN) file. If you do not specify an argument or value, LightDB-AR uses the default.
The db.connect[.dsn]()
functions return an integer connection identifier. You specify this identifier when you operate on tables or views in the database. You also specify this identifier when you close the connection.
The db.disconnect()
function returns a logical that identifies whether or not the connection was successfully disconnected.
To list and display information about active LightDB-A connections, use the db.list()
function.
Example:
## connect to LightDB-A database named testdb on host gpcoordinator
> cid_to_testdb <- db.connect( host = "gpcoordinator", port=5432, dbname = "testdb" )
Loading required package: DBI
Created a connection to database with ID 1
[1] 1
> db.list()
Database Connection Info
## -------------------------------
[Connection ID 1]
Host : gpcoordinator
User : gpadmin
Database : testdb
DBMS : LightDB-A 6
> db.disconnect( cid_to_testdb )
Connection 1 is disconnected!
[1] TRUE
Examining Database Obects
The db.objects()
function lists the tables and views in the database identified by a specific connection identifier. The function signature is:
db.objects( search = NULL, conn.id = 1 )
If you choose, you can specify a filter string to narrow the returned results. For example, to list the tables and views in the public
schema in the database identified by the default connection identifier, invoke the function as follows:
> db.objects( search = "public." )
Analyzing and Manipulating Data
The fundamental data structure of R is the data.frame
. A data frame is a collection of variables represented as a list of vectors. LightDB-AR operates on db.data.frame
objects, and exposes functions to convert to and manipulate objects of this type:
-
as.db.data.frame()
- writes data in a file or adata.frame
into a LightDB-A table. You can also use the function to write the results of a query into a table, or to create a localdb.data.frame
. -
db.data.frame()
- creates a temporary R object that references a view or table in a LightDB-A database. No data is loaded into R when you use this function.
Example:
## create a db.data.frame from the abalone example data set;
## abalone is a data.frame
> abdf1 <- as.db.data.frame(abalone, conn.id = cid_to_testdb, verbose = FALSE)
## sort on the id column and preview the first 5 rows
> lk( sort( abdf1, INDICES=abdf1$id ), 5 )
id sex length diameter height whole shucked viscera shell rings
1 1 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 15
2 2 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070 7
3 3 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210 9
4 4 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 10
5 5 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055 7
## write the data frame to a LightDB-A table named abalone_from_r;
## use most of the function defaults
> as.db.data.frame( abdf1, table.name = "public.abalone_from_r" )
The data contained in table "pg_temp_93"."gp_temp_5bdf4ec7_42f9_9f9799_a0d76231be8f" which is wrapped by abdf1 is c
opied into "public"."abalone_from_r" in database testdb on gpcoordinator !
Table : "public"."abalone_from_r"
Database : testdb
Host : gpcoordinator
Connection : 1
## list database objects, should display the newly created table
> db.objects( search = "public.")
[1] "public.abalone_from_r"
Running R Functions in LightDB-A Database
LightDB-AR supports two functions that allow you to run an R function, in-database, on every row of a LightDB-A Database table: db.gpapply() and db.gptapply(). You can use the LightDB-A PL/R or PL/Container procedural language as the vehicle in which to run the function.
The function signatures follow:
db.gpapply( X, MARGIN = NULL, FUN = NULL, output.name = NULL, output.signature = NULL,
clear.existing = FALSE, case.sensitive = FALSE, output.distributeOn = NULL,
debugger.mode = FALSE, runtime.id = "plc_r_shared", language = "plcontainer",
input.signature = NULL, ... )
db.gptapply( X, INDEX, FUN = NULL, output.name = NULL, output.signature = NULL,
clear.existing = FALSE, case.sensitive = FALSE,
output.distributeOn = NULL, debugger.mode = FALSE,
runtime.id = "plc_r_shared", language = "plcontainer",
input.signature = NULL, ... )
Use the second variant of the function when the table data is indexed.
By default, db.gp[t]apply()
passes a single data frame input argument to the R function FUN
. If you define FUN
to take a list of arguments, you must specify the function argument name to LightDB-A table column name mapping in input.signature
. You must specify this mapping in table column order.
Example 1:
Create a LightDB-A table named table1
in the database named testdb
. This table has a single integer-type field. Populate the table with some data:
user@clientsys$ psql -h gpcoordinator -d testdb
testdb=# CREATE TABLE table1( id int );
testdb=# INSERT INTO table1 SELECT generate_series(1,13);
testdb=# \q
Create an R function that increments an integer. Run the function on table1
in LightDB-A using the PL/R procedural language. Then write the new values to a table named table1_r_inc
:
user@clientsys$ R
> ## create a reference to table1
> t1 <- db.data.frame("public.table1")
> ## create an R function that increment an integer by 1
> fn.function_plus_one <- function(num)
{
return (num[[1]] + 1)
}
> ## create the function output signature
> .sig <- list( "num" = "int" )
> ## run the function in LightDB-A and print
> x <- db.gpapply( t1, output.name = NULL, FUN = fn.function_plus_one,
output.signature = .sig, clear.existing = TRUE, case.sensitive = TRUE, language = "plr" )
> print(x)
num
1 2
2 6
3 12
4 13
5 3
6 4
7 5
8 7
9 8
10 9
11 10
12 11
13 14
> ## run the function in LightDB-A and write to the output table
> db.gpapply(t1, output.name = "public.table1_r_inc", FUN = fn.function_plus_one,
output.signature = .sig, clear.existing = TRUE, case.sensitive = TRUE,
language = "plr" )
## list database objects, should display the newly created table
> db.objects( search = "public.")
[1] "public.abalone_from_r" "public.table1_r_inc"
Example 2:
Create a LightDB-A table named table2
in the database named testdb
. This table has two integer-type fields. Populate the table with some data:
user@clientsys$ psql -h gpcoordinator -d testdb
testdb=# CREATE TABLE table2( c1 int, c2 int );
testdb=# INSERT INTO table2 VALUES (1, 2);
testdb=#\q
Create an R function that takes two integer arguments, manipulates the arguments, and returns both. Run the function on the data in table2
in LightDB-A using the PL/R procedural language, writing the new values to a table named table2_r_upd
:
user@clientsys$ R
> ## create a reference to table2
> t2 <- db.data.frame("public.table2")
> ## create the R function
> fn.func_with_two_args <- function(a, b)
{
a <- a * 20
b <- a + 66
c <- list(a, b)
return (as.data.frame(c))
}
> ## create the function input signature, mapping function argument name to
> ## table column name
> input.sig <- list('a' = 'c1', 'b' = 'c2')
> ## create the function output signature
> return.sig <- list('a' = 'int', 'b' = 'int')
> ## run the function in LightDB-A and write to the output table
> db.gpapply(t2, output.name = "public.table2_r_upd", FUN = fn.func_with_two_args,
output.signature = return.sig, clear.existing = TRUE, case.sensitive = TRUE,
language = "plr", input.signature = input.sig )
View the contents of the LightDB-A table named table2_r_upd
:
user@clientsys$ psql -h gpcoordinator -d testdb
testdb=# SELECT * FROM table2_r_upd;
a | b
----+----
20 | 86
(1 row)
testdb=# \q
Limitations
LightDB-AR has the following limitations:
- The
db.gpapply()
anddb.gptapply()
functions currently operate only onconn.id = 1
. - The LightDB-AR function reference pages are not yet published. You can find
html
versions of the reference pages in your R libraryR/<platform>-library/3.6/LightDB-AR/html
directory. Open the00Index.html
file in the browser of your choice and select the function of interest. - The
LightDB-AR()
graphical user interface is currently unsupported.
Function Summary
LightDB-AR provides several functions. To obtain reference information for LightDB-AR functions while running the R console, invoke the help()
function. For example, to display the reference information for the LightDB-AR db.data.frame()
function:
> help( db.data.frame )
LightDB-AR functions include:
Category | Name | Description |
---|---|---|
Aggregate Functions | mean(), sum(), count(), max(), min(), sd(), var(), colMeans(), colSums(), colAgg(), db.array() | Functions that perform a calculation on multiple values and return a single value. |
Connectivity Functions | connection info | Extract connection information. |
db.connect() | Create a database connection. | |
db.connect.dsn() | Create a database connection using a DSN. | |
db.disconnect() | Disconnect a database connection. | |
db.list() | List database connections. | |
Database Object Functions | as.db.data.frame() | Create a db.data.frame from a file or data.frame, optionally writing the data to a LightDB-A table. |
db.data.frame() | Create a data frame that references a view or table in the database. | |
db.objects() | List the table and view objects in the database. | |
db.existsObject() | Identifies whether a table or view exists in the database. | |
Mathematical Functions | exp(), abs(), log(), log10(), sign(), sqrt(), factorial(), sin(), cos(), tan(), asin(), acos(), atan(), atan2() | Mathematical functions that take db.obj as an argument. |
LightDB-A Functions | db.gpapply() | Wrap an R function with a UDF and run it on every row of data in a LightDB-A table. |
db.gptapply() | Wrap an R function with a UDF and run it on every row of data grouped by an index in a LightDB-A table. |