LightDB-A Database R Client

This chapter contains the following information:

About LightDB-A R

The LightDB-A R Client (LightDB-AR) is an interactive in-database data analytics tool for LightDB-A Database. The client provides an R interface to tables and views, and requires no SQL knowledge to operate on these database objects.

You can use LightDB-AR with the LightDB-A PL/R procedural language to run an R function on data stored in LightDB-A Database. LightDB-AR parses the R function and creates a user-defined function (UDF) for execution in LightDB-A. LightDB-A runs the UDF in parallel on the segment hosts.

You can similarly use LightDB-AR with LightDB-A PL/Container 3 (Beta), to run an R function against LightDB-A data in a high-performance R sandbox runtime environment.

No analytic data is loaded into R when you use LightDB-AR, a key requirement when dealing with large data sets. Only the R function and minimal data is transferred between R and LightDB-A.

Supported Platforms

LightDB-AR supports the following component versions:

LightDB-AR Version R Version LightDB-A Version PL/R Version PL/Container Version
1,1,0, 1.0.0 3.6+ 6.1+ 3.0.2+ 3.0.0 (Beta)

Prerequisites

You can use LightDB-AR with LightDB-A Database and the PL/R and PL/Container procedural languages. Before you install and run LightDB-AR on a client system:

  • Ensure that your LightDB-A Database installation is running version 6.1 or newer.
  • Ensure that your client development system has connectivity to the LightDB-A Database coordinator host.
  • Ensure that R version 3.6.0 or newer is installed on your client system, and that you set the $R_HOME environment variable appropriately.
  • Determine the procedural language(s) you plan to use with LightDB-AR, and ensure that the language(s) is installed and configured in your LightDB-A Database cluster. Refer to PL/R Language and PL/Container Language (3.0 Beta) for language installation and configuration instructions.
  • Verify that you have registered the procedural language(s) in each database in which you plan to use LightDB-AR to read data from or write data to LightDB-A. For example, the following command lists the extensions and languages registered in the database named testdb:

    $ psql -h gpcoordinator -d testdb -c '\dx'
                                         List of installed extensions
        Name     | Version  |   Schema   |                          Description     
    
    -------------+----------+------------+----------------------------------------------------------------
     plcontainer | 1.0.0    | public     | GPDB execution sandboxing for Python and R
     plpgsql     | 1.0      | pg_catalog | PL/pgSQL procedural language
     plr         | 8.3.0.16 | public     | load R interpreter and run R script from within a database
    
    

    Check for the plr and/or plcontainer extension Name.

Installing the LightDB-A R Client

LightDB-AR is an R package. You obtain the package from VMware Tanzu Network (or the LightDB-AR github repository) and install the package within the R console.

  1. Download the package from the LightDB-A Procedural Languages filegroup on VMware Tanzu Network. The naming format of the downloaded file is greenplumR‑<version>‑gp6.tar.gz.

    Note the file system location of the downloaded file.

  2. Follow the instructions in Verifying the LightDB-A Database Software Download to verify the integrity of the LightDB-A Procedural Languages LightDB-AR software.

  3. Install the dependent R packages: ini, shiny, and RPostgreSQL. For example, enter the R console and run the following, or equivalent, command:

    user@clientsys$ R
    > install.packages(c("ini", "shiny", "RPostgreSQL"))
    

    You may be prompted to select a CRAN download mirror. And, depending on your client system configuration, you may be prompted to use or create a personal library.

    After downloading, R builds and installs these and dependent packages.

  4. Install the LightDB-AR R package:

    > install.packages("/path/to/greenplumR‑<version>‑gp6.tar.gz", repos = NULL, type = "source")
    
  5. Install the Rdpack documentation package:

    > install.packages("Rdpack")
    

Example Data Sets

LightDB-AR includes the abalone and null.data sample datasets. Refer to the reference pages for more information:

> help( abalone )
> help( null.data )

Using the LightDB-A R Client

You use LightDB-AR to perform in-database analytics. Typical operations that you may perform include:

  • Loading the LightDB-AR package.
  • Connecting to and disconnecting from LightDB-A Database.
  • Examining database objects.
  • Analyzing and manipulating data.
  • Running R functions in LightDB-A Database.

Loading LightDB-AR

Use the R library() function to load LightDB-AR:

user@clientsys$ R
> library("LightDB-AR")

Connecting to LightDB-A Database

The db.connect() and db.connect.dsn() LightDB-AR functions establish a connection to LightDB-A Database. The db.disconnect() function closes a database connection.

The LightDB-AR connect and disconnect function signatures follow:

db.connect( host = "localhost", user = Sys.getenv("USER"),
            dbname = user, password = "", port = 5432, conn.pkg = "RPostgreSQL",
            default.schemas = NULL, verbose = TRUE, quick = FALSE )

db.connect.dsn( dsn.key, db.ini = "~/db.ini", default.schemas = NULL,
                verbose = TRUE, quick = FALSE )

db.disconnect( conn.id = 1, verbose = TRUE, force = FALSE )

When you connect to LightDB-A Database, you provide the coordinator host, port, database name, user name, password, and other information via function arguments or a data source name (DSN) file. If you do not specify an argument or value, LightDB-AR uses the default.

The db.connect[.dsn]() functions return an integer connection identifier. You specify this identifier when you operate on tables or views in the database. You also specify this identifier when you close the connection.

The db.disconnect() function returns a logical that identifies whether or not the connection was successfully disconnected.

To list and display information about active LightDB-A connections, use the db.list() function.

Example:

## connect to LightDB-A database named testdb on host gpcoordinator
> cid_to_testdb <- db.connect( host = "gpcoordinator", port=5432, dbname = "testdb" )
Loading required package: DBI
Created a connection to database with ID 1 
[1] 1

> db.list()
Database Connection Info
## -------------------------------
[Connection ID 1]
Host     :    gpcoordinator
User     :    gpadmin
Database :    testdb
DBMS     :    LightDB-A 6

> db.disconnect( cid_to_testdb )
Connection 1 is disconnected!
[1] TRUE

Examining Database Obects

The db.objects() function lists the tables and views in the database identified by a specific connection identifier. The function signature is:

db.objects( search = NULL, conn.id = 1 )

If you choose, you can specify a filter string to narrow the returned results. For example, to list the tables and views in the public schema in the database identified by the default connection identifier, invoke the function as follows:

> db.objects( search = "public." )

Analyzing and Manipulating Data

The fundamental data structure of R is the data.frame. A data frame is a collection of variables represented as a list of vectors. LightDB-AR operates on db.data.frame objects, and exposes functions to convert to and manipulate objects of this type:

  • as.db.data.frame() - writes data in a file or a data.frame into a LightDB-A table. You can also use the function to write the results of a query into a table, or to create a local db.data.frame.
  • db.data.frame() - creates a temporary R object that references a view or table in a LightDB-A database. No data is loaded into R when you use this function.

Example:

## create a db.data.frame from the abalone example data set;
## abalone is a data.frame
> abdf1 <- as.db.data.frame(abalone, conn.id = cid_to_testdb, verbose = FALSE)

## sort on the id column and preview the first 5 rows
> lk( sort( abdf1, INDICES=abdf1$id ), 5 )
  id sex length diameter height  whole shucked viscera shell rings
1  1   M  0.455    0.365  0.095 0.5140  0.2245  0.1010 0.150    15
2  2   M  0.350    0.265  0.090 0.2255  0.0995  0.0485 0.070     7
3  3   F  0.530    0.420  0.135 0.6770  0.2565  0.1415 0.210     9
4  4   M  0.440    0.365  0.125 0.5160  0.2155  0.1140 0.155    10
5  5   I  0.330    0.255  0.080 0.2050  0.0895  0.0395 0.055     7

## write the data frame to a LightDB-A table named abalone_from_r;
## use most of the function defaults
> as.db.data.frame( abdf1, table.name = "public.abalone_from_r" ) 
The data contained in table "pg_temp_93"."gp_temp_5bdf4ec7_42f9_9f9799_a0d76231be8f" which is wrapped by abdf1 is c
opied into "public"."abalone_from_r" in database testdb on gpcoordinator !
Table       :    "public"."abalone_from_r"
Database    :    testdb
Host        :    gpcoordinator
Connection  :    1

## list database objects, should display the newly created table
> db.objects( search = "public.")
[1] "public.abalone_from_r"

Running R Functions in LightDB-A Database

LightDB-AR supports two functions that allow you to run an R function, in-database, on every row of a LightDB-A Database table: db.gpapply() and db.gptapply(). You can use the LightDB-A PL/R or PL/Container procedural language as the vehicle in which to run the function.

The function signatures follow:

db.gpapply( X, MARGIN = NULL, FUN = NULL, output.name = NULL, output.signature = NULL,
            clear.existing = FALSE, case.sensitive = FALSE, output.distributeOn = NULL,
            debugger.mode = FALSE, runtime.id = "plc_r_shared", language = "plcontainer",
            input.signature = NULL, ... )

db.gptapply( X, INDEX, FUN = NULL, output.name = NULL, output.signature = NULL,
             clear.existing = FALSE, case.sensitive = FALSE,
             output.distributeOn = NULL, debugger.mode = FALSE,
             runtime.id = "plc_r_shared", language = "plcontainer",
             input.signature = NULL, ... )

Use the second variant of the function when the table data is indexed.

By default, db.gp[t]apply() passes a single data frame input argument to the R function FUN. If you define FUN to take a list of arguments, you must specify the function argument name to LightDB-A table column name mapping in input.signature. You must specify this mapping in table column order.

Example 1:

Create a LightDB-A table named table1 in the database named testdb. This table has a single integer-type field. Populate the table with some data:

user@clientsys$ psql -h gpcoordinator -d testdb
testdb=# CREATE TABLE table1( id int );
testdb=# INSERT INTO table1 SELECT generate_series(1,13);
testdb=# \q

Create an R function that increments an integer. Run the function on table1 in LightDB-A using the PL/R procedural language. Then write the new values to a table named table1_r_inc:

user@clientsys$ R
> ## create a reference to table1
> t1 <- db.data.frame("public.table1")

> ## create an R function that increment an integer by 1
> fn.function_plus_one <- function(num)
{
    return (num[[1]] + 1)
}

> ## create the function output signature
> .sig <- list( "num" = "int" )

> ## run the function in LightDB-A and print
> x <- db.gpapply( t1, output.name = NULL, FUN = fn.function_plus_one,
  output.signature = .sig, clear.existing = TRUE, case.sensitive = TRUE, language = "plr" )
> print(x)
   num
1    2
2    6
3   12
4   13
5    3
6    4
7    5
8    7
9    8
10   9
11  10
12  11
13  14

> ## run the function in LightDB-A and write to the output table
> db.gpapply(t1, output.name = "public.table1_r_inc", FUN = fn.function_plus_one,
   output.signature = .sig, clear.existing = TRUE, case.sensitive = TRUE,
   language = "plr" )

## list database objects, should display the newly created table
> db.objects( search = "public.")
[1] "public.abalone_from_r"     "public.table1_r_inc"

Example 2:

Create a LightDB-A table named table2 in the database named testdb. This table has two integer-type fields. Populate the table with some data:

user@clientsys$ psql -h gpcoordinator -d testdb
testdb=# CREATE TABLE table2( c1 int, c2 int );
testdb=# INSERT INTO table2 VALUES (1, 2);
testdb=#\q

Create an R function that takes two integer arguments, manipulates the arguments, and returns both. Run the function on the data in table2 in LightDB-A using the PL/R procedural language, writing the new values to a table named table2_r_upd:

user@clientsys$ R
> ## create a reference to table2
> t2 <- db.data.frame("public.table2")

> ## create the R function
> fn.func_with_two_args <- function(a, b)
{
    a <- a * 20
    b <- a + 66
    c <- list(a, b)
    return (as.data.frame(c))
}

> ## create the function input signature, mapping function argument name to
> ## table column name
> input.sig <- list('a' = 'c1', 'b' = 'c2')

> ## create the function output signature
> return.sig <- list('a' = 'int', 'b' = 'int')

> ## run the function in LightDB-A and write to the output table
> db.gpapply(t2, output.name = "public.table2_r_upd", FUN = fn.func_with_two_args,
   output.signature = return.sig, clear.existing = TRUE, case.sensitive = TRUE,
   language = "plr", input.signature = input.sig )

View the contents of the LightDB-A table named table2_r_upd:

user@clientsys$ psql -h gpcoordinator -d testdb
testdb=# SELECT * FROM table2_r_upd;
 a  | b  
----+----
 20 | 86
(1 row)
testdb=# \q

Limitations

LightDB-AR has the following limitations:

  • The db.gpapply() and db.gptapply() functions currently operate only on conn.id = 1.
  • The LightDB-AR function reference pages are not yet published. You can find html versions of the reference pages in your R library R/<platform>-library/3.6/LightDB-AR/html directory. Open the 00Index.html file in the browser of your choice and select the function of interest.
  • The LightDB-AR() graphical user interface is currently unsupported.

Function Summary

LightDB-AR provides several functions. To obtain reference information for LightDB-AR functions while running the R console, invoke the help() function. For example, to display the reference information for the LightDB-AR db.data.frame() function:

> help( db.data.frame )

LightDB-AR functions include:

Table 2. LightDB-AR Functions
Category Name Description
Aggregate Functions mean(), sum(), count(), max(), min(), sd(), var(), colMeans(), colSums(), colAgg(), db.array() Functions that perform a calculation on multiple values and return a single value.
Connectivity Functions connection info Extract connection information.
db.connect() Create a database connection.
db.connect.dsn() Create a database connection using a DSN.
db.disconnect() Disconnect a database connection.
db.list() List database connections.
Database Object Functions as.db.data.frame() Create a db.data.frame from a file or data.frame, optionally writing the data to a LightDB-A table.
db.data.frame() Create a data frame that references a view or table in the database.
db.objects() List the table and view objects in the database.
db.existsObject() Identifies whether a table or view exists in the database.
Mathematical Functions exp(), abs(), log(), log10(), sign(), sqrt(), factorial(), sin(), cos(), tan(), asin(), acos(), atan(), atan2() Mathematical functions that take db.obj as an argument.
LightDB-A Functions db.gpapply() Wrap an R function with a UDF and run it on every row of data in a LightDB-A table.
db.gptapply() Wrap an R function with a UDF and run it on every row of data grouped by an index in a LightDB-A table.