450 lines
15 KiB
Markdown
450 lines
15 KiB
Markdown
# TODOs
|
|
Doc:
|
|
- [x] Stiefel (instead of Stiefl)
|
|
- [x] Return value description (`@returs`)
|
|
- [x] DESCRIPTION
|
|
- [x] Maintainer
|
|
- [x] Author
|
|
- [x] Volume
|
|
- [x] Description (from Paper) and Ref.
|
|
- [x] Ref paper in doc
|
|
- [ ] Data set descriptions and augmentations.
|
|
- [x] Demonstration of the `Logger` function usage (Demo file or so, ...)
|
|
- [ ] Update Paper (to new version / version consistent with current code!)
|
|
- [ ] Reference Paper in DESCRIPTION file (in Description or specific tag)
|
|
- [ ] Split `cve` and `cve.call` docs.
|
|
- [ ] "Copy" form `dr` package (specifically `dr.directions` -> description)
|
|
- [ ] Document `C` code.
|
|
|
|
Methods to be implemented:
|
|
- [x] simple
|
|
- [x] weighted
|
|
- [x] momentum
|
|
- [x] weighted with momentum
|
|
|
|
Performance:
|
|
- [x] Pure C implementation.
|
|
- [NOT Feasible] Stochastic Version
|
|
- [NOT Feasible] Gradient Approximations (using Algebraic Software for alternative Loss function formulations and gradient optimizations)
|
|
- [NOT Sufficient] Alternative Kernels for reducing samples
|
|
- [ ] (To Be further investigated) "Kronecker" optimization
|
|
|
|
Features (functions):
|
|
- [x] Initial `V.init` parameter (only ONE try, ignore number of `attempts` parameter)
|
|
- [x] `basis.cve` list of estimated `B`s (with `k` supplied, only `B`)
|
|
- [x] `directions.cve` Projected `X` given `k`
|
|
- [x] `predict.cve` using `mars` for predicting responses given new data.
|
|
- [x] `predict.dim.cve` Cross-validation or `aov` (in stats package) or "elbow" estimation
|
|
- [x] `plot.elbow`
|
|
- [x] `summary`
|
|
- [ ] Consider `cor.test` for dimension selection
|
|
- [x] Check for user interrupt (`R_CheckUserInterrupt`)
|
|
|
|
Changes:
|
|
- [x] New `estimate.bandwidth` implementation.
|
|
(h = 2 * (tr(\Sigma) / p) * (6/5 * n^(-1 / (4 + k)))^2,
|
|
\Sigma = 1/n * (X-mean)'(X-mean))
|
|
|
|
Errors:
|
|
- [x] `CVE_C` compare to `CVE_legacy`.
|
|
- [x] fix: `predict.dim` not found.
|
|
|
|
# Development
|
|
## Build and install.
|
|
To build the package the `devtools` package is used. This also provides `roxygen2` which is used for documentation and automatic creation of the `NAMESPACE` file.
|
|
```R
|
|
setwd("./CVE_R") # Set path to the package root.
|
|
library(devtools) # Load required `devtools` package.
|
|
document() # Create `.Rd` files and write `NAMESPACE`.
|
|
```
|
|
Next the package needs to be build, therefore (if pure `R` package, aka. `C/C++`, `Fortran`, ... code) just do the following.
|
|
```bash
|
|
R CMD build CVE_C; R CMD INSTALL CVE_0.2.tar.gz
|
|
```
|
|
Then we are ready for using the package.
|
|
As well as building the `NAMESPACE` and `*.Rd` files using `devtools` (`roxygen2`) the following resembles an entire build pipeline including checks.
|
|
```bash
|
|
R -q -e 'library(devtools); setwd("CVE_C"); pkgbuild::compile_dll(); document(); pkgbuild::clean_dll()'
|
|
R CMD build CVE_C; R CMD check CVE_0.2.tar.gz;
|
|
R CMD INSTALL CVE_0.2.tar.gz
|
|
```
|
|
## Build and install from within `R`.
|
|
An alternative approach is the following.
|
|
```R
|
|
## Installing CVE (C implementation)
|
|
(setwd('~/Projects/CVE/CVE_C'))
|
|
# equiv to Rcpp::compileAttributes().
|
|
library(devtools)
|
|
pkgbuild::compile_dll() # required for packages with C/C++ code
|
|
document() # See bug: https://github.com/stan-dev/rstantools/issues/52
|
|
pkgbuild::clean_dll()
|
|
(path <- build(vignettes = FALSE))
|
|
install.packages(path, repos = NULL, type = "source")
|
|
library(CVE)
|
|
```
|
|
**Note: I only recommend this approach during development.**
|
|
|
|
# Package Structure
|
|
|
|
## Demos
|
|
A demo is an `.R` file that lives in `demo/`. Demos are like examples but tend to
|
|
be longer. Instead of focusing on a single function, they show how to weave
|
|
together multiple functions to solve a problem.
|
|
|
|
You list and access demos with `demo()`:
|
|
* Show all available demos: `demo()`.
|
|
* Show all demos in a package: `demo(package = "CVE")`.
|
|
* Run a specific demo: `demo("runtime_test", package = "CVE")`.
|
|
* Find a demo: `system.file("demo", "runtime_test.R", package = "CVE")`.
|
|
|
|
Each demo must be listed in `demo/00Index` in the following form:
|
|
`demo-name Demo description`.
|
|
The demo name is the name of the file without the extension,
|
|
e.g. `demo/runtime_test.R` becomes `runtime_test`.
|
|
|
|
By default the demo ask for human input for each plot: "Hit to see next plot".
|
|
This behavior can be overridden by adding `devAskNewPage(ask = FALSE)` to
|
|
the demo file. You can add pauses by adding:
|
|
`readline("press any key to continue")`.
|
|
|
|
**Note**: Demos are not automatically tested by `R CMD check`. This means that they
|
|
can easily break without your knowledge.
|
|
|
|
|
|
# General Notes for Source Code analysis
|
|
## Search in multiple files.
|
|
Using the Linux `grep` program with the parameters `-rnw` and specifying a include files filter like the following example.
|
|
```bash
|
|
grep --include=*\.{c,h,R} -rnw '.' -e "sweep"
|
|
```
|
|
searches in all `C` source and header files as well as `R` source files for the term _sweep_.
|
|
|
|
## Recursive directory compare with colored structure (more or less).
|
|
```bash
|
|
diff -r CVE_R/ CVE_C/ | grep -E "^([<>]|[^<>].*)"
|
|
```
|
|
|
|
## Parsing `bash` script parameters.
|
|
```bash
|
|
usage="$0 [-v|--verbose] [-n|--dry-run] [(-s|--stack-size) <size>] [-h|--help] [-- [p1, [p2, ...]]]"
|
|
verbose=false
|
|
help=false
|
|
dry_run=false
|
|
stack_size=0
|
|
|
|
while [ $# -gt 0 ]; do
|
|
case "$1" in
|
|
-v | --verbose ) verbose=true; shift ;;
|
|
-n | --dry-run ) dry_run=true; shift ;;
|
|
-s | --stack-size ) stack_size="$2"; shift; shift ;;
|
|
-h | --help ) echo $usage; exit ;; # On help print usage and exit.
|
|
-- ) shift; break ;; # Break param "parsing".
|
|
* ) echo $usage >&2; exit 1 ;; # Print usage and exit with failure.
|
|
esac
|
|
done
|
|
|
|
echo verbose=$verbose
|
|
echo dry_run=$dry_run
|
|
echo stack_size=$stack_size
|
|
```
|
|
|
|
# Analysis
|
|
## Logging (a `cve` run).
|
|
To log `loss`, `error` (estimated) the true error (error of current estimated `B` against the true `B`) or even the step size one can use the `logger` parameter. A `logger` is a function that gets the current `environment` of the CVE optimization methods (__do not alter this environment, only read from it__). This can be used to create logs like in the following example.
|
|
```R
|
|
library(CVE)
|
|
|
|
# Setup histories.
|
|
(epochs <- 50)
|
|
(attempts <- 10)
|
|
loss.history <- matrix(NA, epochs + 1, attempts)
|
|
error.history <- matrix(NA, epochs + 1, attempts)
|
|
tau.history <- matrix(NA, epochs + 1, attempts)
|
|
true.error.history <- matrix(NA, epochs + 1, attempts)
|
|
|
|
# Create a dataset
|
|
ds <- dataset("M1")
|
|
X <- ds$X
|
|
Y <- ds$Y
|
|
B <- ds$B # the true `B`
|
|
(k <- ncol(ds$B))
|
|
|
|
# True projection matrix.
|
|
P <- B %*% solve(t(B) %*% B) %*% t(B)
|
|
# Define the logger for the `cve()` method.
|
|
logger <- function(env) {
|
|
# Note the `<<-` assignement!
|
|
loss.history[env$epoch + 1, env$attempt] <<- env$loss
|
|
error.history[env$epoch + 1, env$attempt] <<- env$error
|
|
tau.history[env$epoch + 1, env$attempt] <<- env$tau
|
|
# Compute true error by comparing to the true `B`
|
|
B.est <- null(env$V) # Function provided by CVE
|
|
P.est <- B.est %*% solve(t(B.est) %*% B.est) %*% t(B.est)
|
|
true.error <- norm(P - P.est, 'F') / sqrt(2 * k)
|
|
true.error.history[env$epoch + 1, env$attempt] <<- true.error
|
|
}
|
|
# Performa SDR
|
|
dr <- cve(Y ~ X, k = k, logger = logger, epochs = epochs, attempts = attempts)
|
|
# Plot history's
|
|
par(mfrow = c(2, 2))
|
|
matplot(loss.history, type = 'l', log = 'y', xlab = 'iter',
|
|
main = 'loss', ylab = expression(L(V[iter])))
|
|
matplot(error.history, type = 'l', log = 'y', xlab = 'iter',
|
|
main = 'error', ylab = 'error')
|
|
matplot(tau.history, type = 'l', log = 'y', xlab = 'iter',
|
|
main = 'tau', ylab = 'tau')
|
|
matplot(true.error.history, type = 'l', log = 'y', xlab = 'iter',
|
|
main = 'true error', ylab = 'true error')
|
|
```
|
|
|
|
## Reading log files.
|
|
The run-time tests (upcoming further tests) are creating log files saved in `tmp/`. These log files are `CSV` files (actually `TSV`) with a header storing the test results. Depending on the test the files may contain different data. As an example we use the run-time test logs which store in each line the `dataset`, the used `method` as well as the `error` (actual error of estimated `B` against real `B`) and the `time`. For reading and analyzing the data see the following example.
|
|
```R
|
|
# Load log as `data.frame`
|
|
log <- read.csv('tmp/test0.log', sep = '\t')
|
|
# Create a error boxplot grouped by dataset.
|
|
boxplot(error ~ dataset, log)
|
|
|
|
# Overview
|
|
for (ds.name in paste0('M', seq(5))) {
|
|
ds <- subset(log, dataset == ds.name, select = c('method', 'dataset', 'time', 'error'))
|
|
print(summary(ds))
|
|
}
|
|
```
|
|
|
|
## Environments and variable lookup.
|
|
In the following a view simple examples of how `R` searches for variables.
|
|
In addition we manipulate function closures to alter the search path in variable lookup and outer scope variable manipulation.
|
|
```R
|
|
droids <- "These aren't the droids you're looking for."
|
|
|
|
search <- function() {
|
|
print(droids)
|
|
}
|
|
|
|
trooper.seeks <- function() {
|
|
droids <- c("R2-D2", "C-3PO")
|
|
search()
|
|
}
|
|
|
|
jedi.seeks <- function() {
|
|
droids <- c("R2-D2", "C-3PO")
|
|
environment(search) <- environment()
|
|
search()
|
|
}
|
|
|
|
trooper.seeks()
|
|
# [1] "These aren't the droids you're looking for."
|
|
jedi.seeks()
|
|
# [1] "R2-D2", "C-3PO"
|
|
```
|
|
|
|
The next example illustrates how to write (without local copies) to variables outside the functions local environment.
|
|
```R
|
|
counting <- function() {
|
|
count <<- count + 1 # Note the `<<-` assignment.
|
|
}
|
|
|
|
(function() {
|
|
environment(counting) <- environment()
|
|
count <- 0
|
|
|
|
for (i in 1:10) {
|
|
counting()
|
|
}
|
|
|
|
return(count)
|
|
})()
|
|
|
|
(function () {
|
|
closure <- new.env()
|
|
environment(counting) <- closure
|
|
assign("count", 0, envir = closure)
|
|
|
|
for (i in 1:10) {
|
|
counting()
|
|
}
|
|
|
|
return(closure$count)
|
|
})()
|
|
```
|
|
|
|
Another example for the usage of `do.call` where the evaluation of parameters is illustrated (example taken (and altered) from `?do.call`).
|
|
```R
|
|
## examples of where objects will be found.
|
|
A <- "A.Global"
|
|
f <- function(x) print(paste("f.new", x))
|
|
env <- new.env()
|
|
assign("A", "A.new", envir = env)
|
|
assign("f", f, envir = env)
|
|
f <- function(x) print(paste("f.Global", x))
|
|
f(A) # f.Global A.Global
|
|
do.call("f", list(A)) # f.Global A.Global
|
|
do.call("f", list(A), envir = env) # f.new A.Global
|
|
do.call(f, list(A), envir = env) # f.Global A.Global
|
|
do.call("f", list(quote(A)), envir = env) # f.new A.new
|
|
do.call(f, list(quote(A)), envir = env) # f.Global A.new
|
|
do.call("f", list(as.name("A")), envir = env) # f.new A.new
|
|
do.call("f", list(as.name("A")), envir = env) # f.new A.new
|
|
```
|
|
|
|
# Performance benchmarks
|
|
In this section alternative implementations of simple algorithms are compared for there performance.
|
|
|
|
### Computing the trace of a matrix multiplication.
|
|
```R
|
|
library(microbenchmark)
|
|
|
|
A <- matrix(runif(120), 12, 10)
|
|
|
|
# Check correctnes and benckmark performance.
|
|
stopifnot(
|
|
all.equal(
|
|
sum(diag(t(A) %*% A)), sum(diag(crossprod(A, A)))
|
|
),
|
|
all.equal(
|
|
sum(diag(t(A) %*% A)), sum(A * A)
|
|
)
|
|
)
|
|
microbenchmark(
|
|
MM = sum(diag(t(A) %*% A)),
|
|
cross = sum(diag(crossprod(A, A))),
|
|
elem = sum(A * A)
|
|
)
|
|
# Unit: nanoseconds
|
|
# expr min lq mean median uq max neval
|
|
# MM 4232 4570.0 5138.81 4737 4956.0 40308 100
|
|
# cross 2523 2774.5 2974.93 2946 3114.5 5078 100
|
|
# elem 582 762.5 973.02 834 964.0 12945 100
|
|
```
|
|
|
|
```R
|
|
n <- 200
|
|
M <- matrix(runif(n^2), n, n)
|
|
|
|
dnorm2 <- function(x) exp(-0.5 * x^2) / sqrt(2 * pi)
|
|
|
|
stopifnot(
|
|
all.equal(dnorm(M), dnorm2(M))
|
|
)
|
|
microbenchmark(
|
|
dnorm = dnorm(M),
|
|
dnorm2 = dnorm2(M),
|
|
exp = exp(-0.5 * M^2) # without scaling -> irrelevant for usage
|
|
)
|
|
# Unit: microseconds
|
|
# expr min lq mean median uq max neval
|
|
# dnorm 841.503 843.811 920.7828 855.7505 912.4720 2405.587 100
|
|
# dnorm2 543.510 580.319 629.5321 597.8540 607.3795 2603.763 100
|
|
# exp 502.083 535.943 577.2884 548.3745 561.3280 2113.220 100
|
|
```
|
|
|
|
### Using `crosspord()`
|
|
```R
|
|
p <- 12
|
|
q <- 10
|
|
V <- matrix(runif(p * q), p, q)
|
|
|
|
stopifnot(
|
|
all.equal(V %*% t(V), tcrossprod(V)),
|
|
all.equal(V %*% t(V), tcrossprod(V, V))
|
|
)
|
|
microbenchmark(
|
|
V %*% t(V),
|
|
tcrossprod(V),
|
|
tcrossprod(V, V)
|
|
)
|
|
# Unit: microseconds
|
|
# expr min lq mean median uq max neval
|
|
# V %*% t(V) 2.293 2.6335 2.94673 2.7375 2.9060 19.592 100
|
|
# tcrossprod(V) 1.148 1.2475 1.86173 1.3440 1.4650 30.688 100
|
|
# tcrossprod(V, V) 1.003 1.1575 1.28451 1.2400 1.3685 2.742 100
|
|
```
|
|
|
|
### Recycling vs. Sweep
|
|
```R
|
|
(n <- 200)
|
|
(p <- 12)
|
|
(q <- 10)
|
|
X_diff <- matrix(runif(n * (n - 1) / 2 * p), n * (n - 1) / 2, p)
|
|
V <- matrix(rnorm(p * q), p, q)
|
|
vecS <- runif(n * (n - 1) / 2)
|
|
|
|
stopifnot(
|
|
all.equal((X_diff %*% V) * rep(vecS, q),
|
|
sweep(X_diff %*% V, 1, vecS, `*`)),
|
|
all.equal((X_diff %*% V) * rep(vecS, q),
|
|
(X_diff %*% V) * vecS)
|
|
)
|
|
microbenchmark(
|
|
rep = (X_diff %*% V) * rep(vecS, q),
|
|
sweep = sweep(X_diff %*% V, 1, vecS, `*`, check.margin = FALSE),
|
|
recycle = (X_diff %*% V) * vecS
|
|
)
|
|
# Unit: microseconds
|
|
# expr min lq mean median uq max neval
|
|
# rep 851.723 988.3655 1575.639 1203.6385 1440.578 18999.23 100
|
|
# sweep 1313.177 1522.4010 2355.269 1879.2605 2065.399 18783.24 100
|
|
# recycle 719.001 786.1265 1157.285 881.8825 1163.202 19091.79 100
|
|
```
|
|
### Scaled `crossprod` with matrix multiplication order.
|
|
```R
|
|
(n <- 200)
|
|
(p <- 12)
|
|
(q <- 10)
|
|
X_diff <- matrix(runif(n * (n - 1) / 2 * p), n * (n - 1) / 2, p)
|
|
V <- matrix(rnorm(p * q), p, q)
|
|
vecS <- runif(n * (n - 1) / 2)
|
|
|
|
ref <- crossprod(X_diff, X_diff * vecS) %*% V
|
|
stopifnot(
|
|
all.equal(ref, crossprod(X_diff, (X_diff %*% V) * vecS)),
|
|
all.equal(ref, crossprod(X_diff, (X_diff %*% V) * vecS))
|
|
)
|
|
microbenchmark(
|
|
inner = crossprod(X_diff, X_diff * vecS) %*% V,
|
|
outer = crossprod(X_diff, (X_diff %*% V) * vecS)
|
|
)
|
|
# Unit: microseconds
|
|
# expr min lq mean median uq max neval
|
|
# inner 789.065 867.939 1683.812 987.9375 1290.055 16800.265 100
|
|
# outer 1141.479 1216.929 1404.702 1317.7315 1582.800 2531.766 100
|
|
```
|
|
|
|
### Fast dist matrix computation (aka. row sum of squares).
|
|
```R
|
|
library(microbenchmark)
|
|
library(CVE)
|
|
|
|
(n <- 200)
|
|
(N <- n * (n - 1) / 2)
|
|
(p <- 12)
|
|
M <- matrix(runif(N * p), N, p)
|
|
|
|
stopifnot(
|
|
all.equal(rowSums(M^2), rowSums.c(M^2)),
|
|
all.equal(rowSums(M^2), rowSquareSums.c(M))
|
|
)
|
|
microbenchmark(
|
|
sums = rowSums(M^2),
|
|
sums.c = rowSums.c(M^2),
|
|
sqSums.c = rowSquareSums.c(M)
|
|
)
|
|
# Unit: microseconds
|
|
# expr min lq mean median uq max neval
|
|
# sums 666.311 1051.036 1612.3100 1139.0065 1547.657 13940.97 100
|
|
# sums.c 342.647 672.453 1009.9109 740.6255 1224.715 13765.90 100
|
|
# sqSums.c 115.325 142.128 175.6242 153.4645 169.678 759.87 100
|
|
```
|
|
|
|
## Using `Rprof()` for performance.
|
|
The standard method for profiling where an algorithm is spending its time is with `Rprof()`.
|
|
```R
|
|
path <- '../tmp/R.prof' # path to profiling file
|
|
Rprof(path)
|
|
cve.res <- cve.call(X, Y, k = k)
|
|
Rprof(NULL)
|
|
(prof <- summaryRprof(path)) # Summarise results
|
|
```
|
|
**Note: consider to run `gc()` before measuring**, aka cleaning up by explicitly calling the garbage collector.
|