# TODOs Doc: - [x] Stiefel (instead of Stiefl) - [x] Return value description (`@returs`) - [x] DESCRIPTION - [x] Maintainer - [x] Author - [x] Volume - [x] Description (from Paper) and Ref. - [x] Ref paper in doc - [ ] Data set descriptions and augmentations. - [x] Demonstration of the `Logger` function usage (Demo file or so, ...) Methods to be implemented: - [x] simple - [x] weighted - [x] momentum - [x] weighted with momentum Performance: - [x] Pure C implementation. - [NOT Feasible] Stochastic Version - [NOT Feasible] Gradient Approximations (using Algebraic Software for alternative Loss function formulations and gradient optimizations) - [NOT Sufficient] Alternative Kernels for reducing samples - [ ] (To Be further investigated) "Kronecker" optimization Features (functions): - [x] Initial `V.init` parameter (only ONE try, ignore number of `attempts` parameter) - [x] `basis.cve` list of estimated `B`s (with `k` supplied, only `B`) - [x] `directions.cve` Projected `X` given `k` - [ ] `predict.cve` using `mars` for predicting responses given new data. - [ ] `predict.dim.cve` Cross-validation or `aov` (in stats package) or "elbow" estimation - [x] `plot.elbow` - [x] `summary` Changes: - [-] New `estimate.bandwidth` implementation. (h = 2 * (tr(\Sigma) / p) * (6/5 * n^(-1 / (4 + k)))^2, \Sigma = 1/n * (X-mean)'(X-mean)) # Package Structure ## Demos A demo is an `.R` file that lives in `demo/`. Demos are like examples but tend to be longer. Instead of focussing on a single function, they show how to weave together multiple functions to solve a problem. You list and access demos with `demo()`: * Show all available demos: `demo()`. * Show all demos in a package: `demo(package = "CVE")`. * Run a specific demo: `demo("runtime_test", package = "CVE")`. * Find a demo: `system.file("demo", "runtime_test.R", package = "CVE")`. Each demo must be listed in `demo/00Index` in the following form: `demo-name Demo description`. The demo name is the name of the file without the extension, e.g. `demo/runtime_test.R` becomes `runtime_test`. By default the demo ask for human input for each plot: "Hit to see next plot". This behaviour can be overridden by adding `devAskNewPage(ask = FALSE)` to the demo file. You can add pauses by adding: `readline("press any key to continue")`. **Note**: Demos are not automatically tested by `R CMD check`. This means that they can easily break without your knowledge. # General Notes for Source Code analysis ## Search in multiple files. Using the Linux `grep` program with the parameters `-rnw` and specifying a include files filter like the following example. ```bash grep --include=*\.{c,h,R} -rnw '.' -e "sweep" ``` searches in all `C` source and header fils as well as `R` source files for the term _sweep_. ## Recursive dir. compair with colored sructure (more or less). ```bash diff -r CVE_R/ CVE_C/ | grep -E "^([<>]|[^<>].*)" ``` ## Parsing `bash` script parameters. ```bash usage="$0 [-v|--verbose] [-n|--dry-run] [(-s|--stack-size) ] [-h|--help] [-- [p1, [p2, ...]]]" verbose=false help=false dry_run=false stack_size=0 while [ $# -gt 0 ]; do case "$1" in -v | --verbose ) verbose=true; shift ;; -n | --dry-run ) dry_run=true; shift ;; -s | --stack-size ) stack_size="$2"; shift; shift ;; -h | --help ) echo $usage; exit ;; # On help print usage and exit. -- ) shift; break ;; # Break param "parsing". * ) echo $usage >&2; exit 1 ;; # Print usage and exit with failure. esac done echo verbose=$verbose echo dry_run=$dry_run echo stack_size=$stack_size ``` # Development ## Build and install. To build the package the `devtools` package is used. This also provides `roxygen2` which is used for documentation and authomatic creaton of the `NAMESPACE` file. ```R setwd("./CVE_R") # Set path to the package root. library(devtools) # Load required `devtools` package. document() # Create `.Rd` files and write `NAMESPACE`. ``` Next the package needs to be build, therefore (if pure `R` package, aka. `C/C++`, `Fortran`, ... code) just do the following. ```bash R CMD build CVE_R R CMD INSTALL CVE_0.1.tar.gz ``` Then we are ready for using the package. ```R library(CVE) help(package = "CVE") ``` ## Build and install from within `R`. An alternative approach is the following. ```R setwd('./CVE_R') getwd() library(devtools) document() # No vignettes to build but "inst/doc/" is required! (path <- build(vignettes = FALSE)) install.packages(path, repos = NULL, type = "source") ``` **Note: I only recommend this approach during development.** # Analysing ## Logging (a `cve` run). To log `loss`, `error` (estimated) the true error (error of current estimated `B` against the true `B`) or even the stepsize one can use the `logger` parameter. A `logger` is a function that gets the current `environment` of the CVE optimization methods (__do not alter this environment, only read from it__). This can be used to create logs like in the following example. ```R library(CVE) # Setup histories. (epochs <- 50) (attempts <- 10) loss.history <- matrix(NA, epochs + 1, attempts) error.history <- matrix(NA, epochs + 1, attempts) tau.history <- matrix(NA, epochs + 1, attempts) true.error.history <- matrix(NA, epochs + 1, attempts) # Create a dataset ds <- dataset("M1") X <- ds$X Y <- ds$Y B <- ds$B # the true `B` (k <- ncol(ds$B)) # True projection matrix. P <- B %*% solve(t(B) %*% B) %*% t(B) # Define the logger for the `cve()` method. logger <- function(env) { # Note the `<<-` assignement! loss.history[env$epoch + 1, env$attempt] <<- env$loss error.history[env$epoch + 1, env$attempt] <<- env$error tau.history[env$epoch + 1, env$attempt] <<- env$tau # Compute true error by comparing to the true `B` B.est <- null(env$V) # Function provided by CVE P.est <- B.est %*% solve(t(B.est) %*% B.est) %*% t(B.est) true.error <- norm(P - P.est, 'F') / sqrt(2 * k) true.error.history[env$epoch + 1, env$attempt] <<- true.error } # Performa SDR dr <- cve(Y ~ X, k = k, logger = logger, epochs = epochs, attempts = attempts) # Plot history's par(mfrow = c(2, 2)) matplot(loss.history, type = 'l', log = 'y', xlab = 'iter', main = 'loss', ylab = expression(L(V[iter]))) matplot(error.history, type = 'l', log = 'y', xlab = 'iter', main = 'error', ylab = 'error') matplot(tau.history, type = 'l', log = 'y', xlab = 'iter', main = 'tau', ylab = 'tau') matplot(true.error.history, type = 'l', log = 'y', xlab = 'iter', main = 'true error', ylab = 'true error') ``` ## Reading log files. The runtime tests (upcomming further tests) are creating log files saved in `tmp/`. These log files are `CSV` files (actualy `TSV`) with a header storing the test results. Depending on the test the files may contain differnt data. As an example we use the runtime test logs which store in each line the `dataset`, the used `method` as well as the `error` (actual error of estimated `B` against real `B`) and the `time`. For reading and analysing the data see the following example. ```R # Load log as `data.frame` log <- read.csv('tmp/test0.log', sep = '\t') # Create a error boxplot grouped by dataset. boxplot(error ~ dataset, log) # Overview for (ds.name in paste0('M', seq(5))) { ds <- subset(log, dataset == ds.name, select = c('method', 'dataset', 'time', 'error')) print(summary(ds)) } ``` ## Environments and variable lookup. In the following a view simple examples of how `R` searches for variables. In addition we manipulate funciton closures to alter the search path in variable lookup and outer scope variable manipulation. ```R droids <- "These aren't the droids you're looking for." search <- function() { print(droids) } trooper.seeks <- function() { droids <- c("R2-D2", "C-3PO") search() } jedi.seeks <- function() { droids <- c("R2-D2", "C-3PO") environment(search) <- environment() search() } trooper.seeks() # [1] "These aren't the droids you're looking for." jedi.seeks() # [1] "R2-D2", "C-3PO" ``` The next example ilustrates how to write (without local copies) to variables outside the functions local environment. ```R counting <- function() { count <<- count + 1 # Note the `<<-` assignment. } (function() { environment(counting) <- environment() count <- 0 for (i in 1:10) { counting() } return(count) })() (function () { closure <- new.env() environment(counting) <- closure assign("count", 0, envir = closure) for (i in 1:10) { counting() } return(closure$count) })() ``` Another example for the usage of `do.call` where the evaluation of parameters is illustated (example taken (and altered) from `?do.call`). ```R ## examples of where objects will be found. A <- "A.Global" f <- function(x) print(paste("f.new", x)) env <- new.env() assign("A", "A.new", envir = env) assign("f", f, envir = env) f <- function(x) print(paste("f.Global", x)) f(A) # f.Global A.Global do.call("f", list(A)) # f.Global A.Global do.call("f", list(A), envir = env) # f.new A.Global do.call(f, list(A), envir = env) # f.Global A.Global do.call("f", list(quote(A)), envir = env) # f.new A.new do.call(f, list(quote(A)), envir = env) # f.Global A.new do.call("f", list(as.name("A")), envir = env) # f.new A.new do.call("f", list(as.name("A")), envir = env) # f.new A.new ``` # Performance benchmarks In this section alternative implementations of simple algorithms are compared for there performance. ### Computing the trace of a matrix multiplication. ```R library(microbenchmark) A <- matrix(runif(120), 12, 10) # Check correctnes and benckmark performance. stopifnot( all.equal( sum(diag(t(A) %*% A)), sum(diag(crossprod(A, A))) ), all.equal( sum(diag(t(A) %*% A)), sum(A * A) ) ) microbenchmark( MM = sum(diag(t(A) %*% A)), cross = sum(diag(crossprod(A, A))), elem = sum(A * A) ) # Unit: nanoseconds # expr min lq mean median uq max neval # MM 4232 4570.0 5138.81 4737 4956.0 40308 100 # cross 2523 2774.5 2974.93 2946 3114.5 5078 100 # elem 582 762.5 973.02 834 964.0 12945 100 ``` ```R n <- 200 M <- matrix(runif(n^2), n, n) dnorm2 <- function(x) exp(-0.5 * x^2) / sqrt(2 * pi) stopifnot( all.equal(dnorm(M), dnorm2(M)) ) microbenchmark( dnorm = dnorm(M), dnorm2 = dnorm2(M), exp = exp(-0.5 * M^2) # without scaling -> irrelevant for usage ) # Unit: microseconds # expr min lq mean median uq max neval # dnorm 841.503 843.811 920.7828 855.7505 912.4720 2405.587 100 # dnorm2 543.510 580.319 629.5321 597.8540 607.3795 2603.763 100 # exp 502.083 535.943 577.2884 548.3745 561.3280 2113.220 100 ``` ### Using `crosspord()` ```R p <- 12 q <- 10 V <- matrix(runif(p * q), p, q) stopifnot( all.equal(V %*% t(V), tcrossprod(V)), all.equal(V %*% t(V), tcrossprod(V, V)) ) microbenchmark( V %*% t(V), tcrossprod(V), tcrossprod(V, V) ) # Unit: microseconds # expr min lq mean median uq max neval # V %*% t(V) 2.293 2.6335 2.94673 2.7375 2.9060 19.592 100 # tcrossprod(V) 1.148 1.2475 1.86173 1.3440 1.4650 30.688 100 # tcrossprod(V, V) 1.003 1.1575 1.28451 1.2400 1.3685 2.742 100 ``` ### Recycling vs. Sweep ```R (n <- 200) (p <- 12) (q <- 10) X_diff <- matrix(runif(n * (n - 1) / 2 * p), n * (n - 1) / 2, p) V <- matrix(rnorm(p * q), p, q) vecS <- runif(n * (n - 1) / 2) stopifnot( all.equal((X_diff %*% V) * rep(vecS, q), sweep(X_diff %*% V, 1, vecS, `*`)), all.equal((X_diff %*% V) * rep(vecS, q), (X_diff %*% V) * vecS) ) microbenchmark( rep = (X_diff %*% V) * rep(vecS, q), sweep = sweep(X_diff %*% V, 1, vecS, `*`, check.margin = FALSE), recycle = (X_diff %*% V) * vecS ) # Unit: microseconds # expr min lq mean median uq max neval # rep 851.723 988.3655 1575.639 1203.6385 1440.578 18999.23 100 # sweep 1313.177 1522.4010 2355.269 1879.2605 2065.399 18783.24 100 # recycle 719.001 786.1265 1157.285 881.8825 1163.202 19091.79 100 ``` ### Scaled `crossprod` with matmul order. ```R (n <- 200) (p <- 12) (q <- 10) X_diff <- matrix(runif(n * (n - 1) / 2 * p), n * (n - 1) / 2, p) V <- matrix(rnorm(p * q), p, q) vecS <- runif(n * (n - 1) / 2) ref <- crossprod(X_diff, X_diff * vecS) %*% V stopifnot( all.equal(ref, crossprod(X_diff, (X_diff %*% V) * vecS)), all.equal(ref, crossprod(X_diff, (X_diff %*% V) * vecS)) ) microbenchmark( inner = crossprod(X_diff, X_diff * vecS) %*% V, outer = crossprod(X_diff, (X_diff %*% V) * vecS) ) # Unit: microseconds # expr min lq mean median uq max neval # inner 789.065 867.939 1683.812 987.9375 1290.055 16800.265 100 # outer 1141.479 1216.929 1404.702 1317.7315 1582.800 2531.766 100 ``` ### Fast dist matrix computation (aka. row sum of squares). ```R library(microbenchmark) library(CVE) (n <- 200) (N <- n * (n - 1) / 2) (p <- 12) M <- matrix(runif(N * p), N, p) stopifnot( all.equal(rowSums(M^2), rowSums.c(M^2)), all.equal(rowSums(M^2), rowSquareSums.c(M)) ) microbenchmark( sums = rowSums(M^2), sums.c = rowSums.c(M^2), sqSums.c = rowSquareSums.c(M) ) # Unit: microseconds # expr min lq mean median uq max neval # sums 666.311 1051.036 1612.3100 1139.0065 1547.657 13940.97 100 # sums.c 342.647 672.453 1009.9109 740.6255 1224.715 13765.90 100 # sqSums.c 115.325 142.128 175.6242 153.4645 169.678 759.87 100 ``` ## Using `Rprof()` for performance. The standart method for profiling where an algorithm is spending its time is with `Rprof()`. ```R path <- '../tmp/R.prof' # path to profiling file Rprof(path) cve.res <- cve.call(X, Y, k = k) Rprof(NULL) (prof <- summaryRprof(path)) # Summarise results ``` **Note: considure to run `gc()` before measuring**, aka cleaning up by explicitely calling the garbage collector.