2
0
Fork 0
R Package implementing the CVE (Conditional Variance Estimation) Method for SDR (Sufficient Dimension Reduction).
Go to file
Daniel Kapla d5cd6075a9 cleanup: removed old and legacy code 2019-12-16 17:40:30 +01:00
CVE - update: doc, 2019-12-16 17:34:35 +01:00
LaTeX fix: momentum bug, 2019-12-10 08:45:07 +01:00
README.md rewrote (C): to use new Gradient formula, 2019-12-05 17:35:29 +01:00

README.md

TODOs

Doc:

  • Stiefel (instead of Stiefl)
  • Return value description (@returs)
  • DESCRIPTION
    • Maintainer
    • Author
    • Volume
    • Description (from Paper) and Ref.
  • Ref paper in doc
  • Data set descriptions and augmentations.
  • Demonstration of the Logger function usage (Demo file or so, ...)
  • Update Paper (to new version / version consistent with current code!)
  • Reference Paper in DESCRIPTION file (in Description or specific tag)
  • Split cve and cve.call docs.
  • "Copy" form dr package (specifically dr.directions -> description)
  • Document C code.

Methods to be implemented:

  • simple
  • weighted
  • momentum
  • weighted with momentum

Performance:

  • Pure C implementation.
  • [NOT Feasible] Stochastic Version
  • [NOT Feasible] Gradient Approximations (using Algebraic Software for alternative Loss function formulations and gradient optimizations)
  • [NOT Sufficient] Alternative Kernels for reducing samples
  • (To Be further investigated) "Kronecker" optimization
  • Implement "Kronecker" optimization

Features (functions):

  • Initial V.init parameter (only ONE try, ignore number of attempts parameter)
  • basis.cve list of estimated Bs (with k supplied, only B)
  • directions.cve Projected X given k
  • predict.cve using mars for predicting responses given new data.
  • predict.dim.cve Cross-validation or aov (in stats package) or "elbow" estimation
  • plot.elbow
  • summary
  • Consider cor.test for dimension selection
  • Check for user interrupt (R_CheckUserInterrupt)

Changes:

  • New estimate.bandwidth implementation. (h = 2 * (tr(\Sigma) / p) * (6/5 * n^(-1 / (4 + k)))^2, \Sigma = 1/n * (X-mean)'(X-mean))

Errors:

  • CVE_C compare to CVE_legacy.
  • fix: predict.dim not found.
  • fix/check: error computation for break condition (in cve.c)

Development

Build and install.

To build the package the devtools package is used. This also provides roxygen2 which is used for documentation and automatic creation of the NAMESPACE file.

setwd("./CVE_R")  # Set path to the package root.
library(devtools) # Load required `devtools` package.
document()        # Create `.Rd` files and write `NAMESPACE`.

Next the package needs to be build, therefore (if pure R package, aka. C/C++, Fortran, ... code) just do the following.

R CMD build CVE_C; R CMD INSTALL CVE_0.2.tar.gz

Then we are ready for using the package. As well as building the NAMESPACE and *.Rd files using devtools (roxygen2) the following resembles an entire build pipeline including checks.

R -q -e 'library(devtools); setwd("CVE_C"); pkgbuild::compile_dll(); document(); pkgbuild::clean_dll()'
R CMD build CVE_C; R CMD check CVE_0.2.tar.gz;
R CMD INSTALL CVE_0.2.tar.gz

Build and install from within R.

An alternative approach is the following.

## Installing CVE (C implementation)
(setwd('~/Projects/CVE/CVE_C'))
# equiv to Rcpp::compileAttributes().
library(devtools)
pkgbuild::compile_dll() # required for packages with C/C++ code
document()              # See bug: https://github.com/stan-dev/rstantools/issues/52
pkgbuild::clean_dll()
(path <- build(vignettes = FALSE))
install.packages(path, repos = NULL, type = "source")
library(CVE)

Note: I only recommend this approach during development.

Package Structure

Demos

A demo is an .R file that lives in demo/. Demos are like examples but tend to be longer. Instead of focusing on a single function, they show how to weave together multiple functions to solve a problem.

You list and access demos with demo():

  • Show all available demos: demo().
  • Show all demos in a package: demo(package = "CVE").
  • Run a specific demo: demo("runtime_test", package = "CVE").
  • Find a demo: system.file("demo", "runtime_test.R", package = "CVE").

Each demo must be listed in demo/00Index in the following form: demo-name Demo description. The demo name is the name of the file without the extension, e.g. demo/runtime_test.R becomes runtime_test.

By default the demo ask for human input for each plot: "Hit to see next plot". This behavior can be overridden by adding devAskNewPage(ask = FALSE) to the demo file. You can add pauses by adding: readline("press any key to continue").

Note: Demos are not automatically tested by R CMD check. This means that they can easily break without your knowledge.

General Notes for Source Code analysis

Search in multiple files.

Using the Linux grep program with the parameters -rnw and specifying a include files filter like the following example.

grep --include=*\.{c,h,R} -rnw '.' -e "sweep"

searches in all C source and header files as well as R source files for the term sweep.

Recursive directory compare with colored structure (more or less).

diff -r CVE_R/ CVE_C/ | grep -E "^([<>]|[^<>].*)"

Parsing bash script parameters.

usage="$0 [-v|--verbose] [-n|--dry-run] [(-s|--stack-size) <size>] [-h|--help] [-- [p1, [p2, ...]]]"
verbose=false
help=false
dry_run=false
stack_size=0

while [ $# -gt 0 ]; do
    case "$1" in
        -v | --verbose )    verbose=true;      shift ;;
        -n | --dry-run )    dry_run=true;      shift ;;
        -s | --stack-size ) stack_size="$2";   shift; shift ;;
        -h | --help )       echo $usage;       exit ;; # On help print usage and exit.
        -- ) shift; break ;;            # Break param "parsing".
         * ) echo $usage >&2; exit 1 ;; # Print usage and exit with failure.
    esac
done

echo verbose=$verbose
echo dry_run=$dry_run
echo stack_size=$stack_size

Analysis

Logging (a cve run).

To log loss, error (estimated) the true error (error of current estimated B against the true B) or even the step size one can use the logger parameter. A logger is a function that gets the current environment of the CVE optimization methods (do not alter this environment, only read from it). This can be used to create logs like in the following example.

library(CVE)

# Setup histories.
(epochs <- 50)
(attempts <- 10)
loss.history       <- matrix(NA, epochs + 1, attempts)
error.history      <- matrix(NA, epochs + 1, attempts)
tau.history        <- matrix(NA, epochs + 1, attempts)
true.error.history <- matrix(NA, epochs + 1, attempts)

# Create a dataset
ds <- dataset("M1")
X <- ds$X
Y <- ds$Y
B <- ds$B # the true `B`
(k <- ncol(ds$B))

# True projection matrix.
P <- B %*% solve(t(B) %*% B) %*% t(B)
# Define the logger for the `cve()` method.
logger <- function(env) {
    # Note the `<<-` assignement!
    loss.history[env$epoch + 1, env$attempt] <<- env$loss
    error.history[env$epoch + 1, env$attempt] <<- env$error
    tau.history[env$epoch + 1, env$attempt] <<- env$tau
    # Compute true error by comparing to the true `B`
    B.est <- null(env$V) # Function provided by CVE
    P.est <- B.est %*% solve(t(B.est) %*% B.est) %*% t(B.est)
    true.error <- norm(P - P.est, 'F') / sqrt(2 * k)
    true.error.history[env$epoch + 1, env$attempt] <<- true.error
}
# Performa SDR
dr <- cve(Y ~ X, k = k, logger = logger, epochs = epochs, attempts = attempts)
# Plot history's
par(mfrow = c(2, 2))
matplot(loss.history,       type = 'l', log = 'y', xlab = 'iter',
        main = 'loss', ylab = expression(L(V[iter])))
matplot(error.history,      type = 'l', log = 'y', xlab = 'iter',
        main = 'error', ylab = 'error')
matplot(tau.history,        type = 'l', log = 'y', xlab = 'iter',
        main = 'tau', ylab = 'tau')
matplot(true.error.history, type = 'l', log = 'y', xlab = 'iter',
        main = 'true error', ylab = 'true error')

Reading log files.

The run-time tests (upcoming further tests) are creating log files saved in tmp/. These log files are CSV files (actually TSV) with a header storing the test results. Depending on the test the files may contain different data. As an example we use the run-time test logs which store in each line the dataset, the used method as well as the error (actual error of estimated B against real B) and the time. For reading and analyzing the data see the following example.

# Load log as `data.frame`
log <- read.csv('tmp/test0.log', sep = '\t')
# Create a error boxplot grouped by dataset.
boxplot(error ~ dataset, log)

# Overview
for (ds.name in paste0('M', seq(5))) {
    ds <- subset(log, dataset == ds.name, select = c('method', 'dataset', 'time', 'error'))
    print(summary(ds))
}

Environments and variable lookup.

In the following a view simple examples of how R searches for variables. In addition we manipulate function closures to alter the search path in variable lookup and outer scope variable manipulation.

droids <- "These aren't the droids you're looking for."

search <- function() {
    print(droids)
}

trooper.seeks <- function() {
    droids <- c("R2-D2", "C-3PO")
    search()
}

jedi.seeks <- function() {
    droids <- c("R2-D2", "C-3PO")
    environment(search) <- environment()
    search()
}

trooper.seeks()
# [1] "These aren't the droids you're looking for."
jedi.seeks()
# [1] "R2-D2", "C-3PO"

The next example illustrates how to write (without local copies) to variables outside the functions local environment.

counting <- function() {
    count <<- count + 1 # Note the `<<-` assignment.
}

(function() {
    environment(counting) <- environment()
    count <- 0

    for (i in 1:10) {
        counting()
    }

    return(count)
})()

(function () {
    closure <- new.env()
    environment(counting) <- closure
    assign("count", 0, envir = closure)

    for (i in 1:10) {
        counting()
    }

    return(closure$count)
})()

Another example for the usage of do.call where the evaluation of parameters is illustrated (example taken (and altered) from ?do.call).

## examples of where objects will be found.
A <- "A.Global"
f <- function(x) print(paste("f.new", x))
env <- new.env()
assign("A", "A.new", envir = env)
assign("f", f, envir = env)
f <- function(x) print(paste("f.Global", x))
f(A)                                          # f.Global A.Global
do.call("f", list(A))                         # f.Global A.Global
do.call("f", list(A), envir = env)            # f.new A.Global
do.call(f,   list(A), envir = env)            # f.Global A.Global
do.call("f", list(quote(A)), envir = env)     # f.new A.new
do.call(f,   list(quote(A)), envir = env)     # f.Global A.new
do.call("f", list(as.name("A")), envir = env) # f.new A.new
do.call("f", list(as.name("A")), envir = env) # f.new A.new

Performance benchmarks

In this section alternative implementations of simple algorithms are compared for there performance.

Computing the trace of a matrix multiplication.

library(microbenchmark)

A <- matrix(runif(120), 12, 10)

# Check correctnes and benckmark performance.
stopifnot(
    all.equal(
        sum(diag(t(A) %*% A)), sum(diag(crossprod(A, A)))
    ),
    all.equal(
        sum(diag(t(A) %*% A)), sum(A * A)
    )
)
microbenchmark(
    MM = sum(diag(t(A) %*% A)),
    cross = sum(diag(crossprod(A, A))),
    elem = sum(A * A)
)
# Unit: nanoseconds
#   expr  min     lq    mean median     uq   max neval
#     MM 4232 4570.0 5138.81   4737 4956.0 40308   100
#  cross 2523 2774.5 2974.93   2946 3114.5  5078   100
#   elem  582  762.5  973.02    834  964.0 12945   100
n <- 200
M <- matrix(runif(n^2), n, n)

dnorm2 <- function(x) exp(-0.5 * x^2) / sqrt(2 * pi)

stopifnot(
    all.equal(dnorm(M), dnorm2(M))
)
microbenchmark(
    dnorm = dnorm(M),
    dnorm2 = dnorm2(M),
    exp = exp(-0.5 * M^2) # without scaling -> irrelevant for usage
)
# Unit: microseconds
#   expr     min      lq     mean   median       uq      max neval
#  dnorm 841.503 843.811 920.7828 855.7505 912.4720 2405.587   100
# dnorm2 543.510 580.319 629.5321 597.8540 607.3795 2603.763   100
#    exp 502.083 535.943 577.2884 548.3745 561.3280 2113.220   100

Using crosspord()

p <- 12
q <- 10
V <- matrix(runif(p * q), p, q)

stopifnot(
    all.equal(V %*% t(V), tcrossprod(V)),
    all.equal(V %*% t(V), tcrossprod(V, V))
)
microbenchmark(
    V %*% t(V),
    tcrossprod(V),
    tcrossprod(V, V)
)
# Unit: microseconds
#              expr   min     lq    mean median     uq    max neval
#        V %*% t(V) 2.293 2.6335 2.94673 2.7375 2.9060 19.592   100
#     tcrossprod(V) 1.148 1.2475 1.86173 1.3440 1.4650 30.688   100
#  tcrossprod(V, V) 1.003 1.1575 1.28451 1.2400 1.3685  2.742   100

Recycling vs. Sweep

(n <- 200)
(p <- 12)
(q <- 10)
X_diff <- matrix(runif(n * (n - 1) / 2 * p), n * (n - 1) / 2, p)
V <- matrix(rnorm(p * q), p, q)
vecS <- runif(n * (n - 1) / 2)

stopifnot(
    all.equal((X_diff %*% V) * rep(vecS, q),
              sweep(X_diff %*% V, 1, vecS, `*`)),
    all.equal((X_diff %*% V) * rep(vecS, q),
              (X_diff %*% V) * vecS)
)
microbenchmark(
    rep = (X_diff %*% V) * rep(vecS, q),
    sweep = sweep(X_diff %*% V, 1, vecS, `*`, check.margin = FALSE),
    recycle = (X_diff %*% V) * vecS
)
# Unit: microseconds
#     expr      min        lq     mean    median       uq      max neval
#      rep  851.723  988.3655 1575.639 1203.6385 1440.578 18999.23   100
#    sweep 1313.177 1522.4010 2355.269 1879.2605 2065.399 18783.24   100
#  recycle  719.001  786.1265 1157.285  881.8825 1163.202 19091.79   100

Scaled crossprod with matrix multiplication order.

(n <- 200)
(p <- 12)
(q <- 10)
X_diff <- matrix(runif(n * (n - 1) / 2 * p), n * (n - 1) / 2, p)
V <- matrix(rnorm(p * q), p, q)
vecS <- runif(n * (n - 1) / 2)

ref <- crossprod(X_diff, X_diff * vecS) %*% V
stopifnot(
    all.equal(ref, crossprod(X_diff, (X_diff %*% V) * vecS)),
    all.equal(ref, crossprod(X_diff, (X_diff %*% V) * vecS))
)
microbenchmark(
    inner = crossprod(X_diff, X_diff * vecS) %*% V,
    outer = crossprod(X_diff, (X_diff %*% V) * vecS)
)
# Unit: microseconds
#   expr      min       lq     mean    median       uq       max neval
#  inner  789.065  867.939 1683.812  987.9375 1290.055 16800.265   100
#  outer 1141.479 1216.929 1404.702 1317.7315 1582.800  2531.766   100

Fast dist matrix computation (aka. row sum of squares).

library(microbenchmark)
library(CVE)

(n <- 200)
(N <- n * (n - 1) / 2)
(p <- 12)
M <- matrix(runif(N * p), N, p)

stopifnot(
    all.equal(rowSums(M^2), rowSums.c(M^2)),
    all.equal(rowSums(M^2), rowSquareSums.c(M))
)
microbenchmark(
    sums = rowSums(M^2),
    sums.c = rowSums.c(M^2),
    sqSums.c = rowSquareSums.c(M)
)
# Unit: microseconds
#      expr     min       lq      mean    median       uq      max neval
#      sums 666.311 1051.036 1612.3100 1139.0065 1547.657 13940.97   100
#    sums.c 342.647  672.453 1009.9109  740.6255 1224.715 13765.90   100
#  sqSums.c 115.325  142.128  175.6242  153.4645  169.678   759.87   100

Using Rprof() for performance.

The standard method for profiling where an algorithm is spending its time is with Rprof().

path <- '../tmp/R.prof' # path to profiling file
Rprof(path)
cve.res <- cve.call(X, Y, k = k)
Rprof(NULL)
(prof <- summaryRprof(path)) # Summarise results

Note: consider to run gc() before measuring, aka cleaning up by explicitly calling the garbage collector.