14 KiB
TODOs
Doc:
- Stiefel (instead of Stiefl)
- Return value description (
@returs
) - DESCRIPTION
- Maintainer
- Author
- Volume
- Description (from Paper) and Ref.
- Ref paper in doc
- Data set descriptions and augmentations.
- Demonstration of the
Logger
function usage (Demo file or so, ...)
Methods to be implemented:
- simple
- weighted
- momentum
- weighted with momentum
Performance:
- Pure C implementation.
- [NOT Feasible] Stochastic Version
- [NOT Feasible] Gradient Approximations (using Algebraic Software for alternative Loss function formulations and gradient optimizations)
- [NOT Sufficient] Alternative Kernels for reducing samples
- (To Be further investigated) "Kronecker" optimization
Features (functions):
- Initial
V.init
parameter (only ONE try, ignore number ofattempts
parameter) basis.cve
list of estimatedB
s (withk
supplied, onlyB
)directions.cve
ProjectedX
givenk
predict.cve
usingmars
for predicting responses given new data.predict.dim.cve
Cross-validation oraov
(in stats package) or "elbow" estimationplot.elbow
summary
Changes:
- [-] New
estimate.bandwidth
implementation. (h = 2 * (tr(\Sigma) / p) * (6/5 * n^(-1 / (4 + k)))^2, \Sigma = 1/n * (X-mean)'(X-mean))
Package Structure
Demos
A demo is an .R
file that lives in demo/
. Demos are like examples but tend to
be longer. Instead of focussing on a single function, they show how to weave
together multiple functions to solve a problem.
You list and access demos with demo()
:
- Show all available demos:
demo()
. - Show all demos in a package:
demo(package = "CVE")
. - Run a specific demo:
demo("runtime_test", package = "CVE")
. - Find a demo:
system.file("demo", "runtime_test.R", package = "CVE")
.
Each demo must be listed in demo/00Index
in the following form:
demo-name Demo description
.
The demo name is the name of the file without the extension,
e.g. demo/runtime_test.R
becomes runtime_test
.
By default the demo ask for human input for each plot: "Hit to see next plot".
This behaviour can be overridden by adding devAskNewPage(ask = FALSE)
to
the demo file. You can add pauses by adding:
readline("press any key to continue")
.
Note: Demos are not automatically tested by R CMD check
. This means that they
can easily break without your knowledge.
General Notes for Source Code analysis
Search in multiple files.
Using the Linux grep
program with the parameters -rnw
and specifying a include files filter like the following example.
grep --include=*\.{c,h,R} -rnw '.' -e "sweep"
searches in all C
source and header fils as well as R
source files for the term sweep.
Recursive dir. compair with colored sructure (more or less).
diff -r CVE_R/ CVE_C/ | grep -E "^([<>]|[^<>].*)"
Parsing bash
script parameters.
usage="$0 [-v|--verbose] [-n|--dry-run] [(-s|--stack-size) <size>] [-h|--help] [-- [p1, [p2, ...]]]"
verbose=false
help=false
dry_run=false
stack_size=0
while [ $# -gt 0 ]; do
case "$1" in
-v | --verbose ) verbose=true; shift ;;
-n | --dry-run ) dry_run=true; shift ;;
-s | --stack-size ) stack_size="$2"; shift; shift ;;
-h | --help ) echo $usage; exit ;; # On help print usage and exit.
-- ) shift; break ;; # Break param "parsing".
* ) echo $usage >&2; exit 1 ;; # Print usage and exit with failure.
esac
done
echo verbose=$verbose
echo dry_run=$dry_run
echo stack_size=$stack_size
Development
Build and install.
To build the package the devtools
package is used. This also provides roxygen2
which is used for documentation and authomatic creaton of the NAMESPACE
file.
setwd("./CVE_R") # Set path to the package root.
library(devtools) # Load required `devtools` package.
document() # Create `.Rd` files and write `NAMESPACE`.
Next the package needs to be build, therefore (if pure R
package, aka. C/C++
, Fortran
, ... code) just do the following.
R CMD build CVE_R
R CMD INSTALL CVE_0.1.tar.gz
Then we are ready for using the package.
library(CVE)
help(package = "CVE")
Build and install from within R
.
An alternative approach is the following.
setwd('./CVE_R')
getwd()
library(devtools)
document()
# No vignettes to build but "inst/doc/" is required!
(path <- build(vignettes = FALSE))
install.packages(path, repos = NULL, type = "source")
Note: I only recommend this approach during development.
Analysing
Logging (a cve
run).
To log loss
, error
(estimated) the true error (error of current estimated B
against the true B
) or even the stepsize one can use the logger
parameter. A logger
is a function that gets the current environment
of the CVE optimization methods (do not alter this environment, only read from it). This can be used to create logs like in the following example.
library(CVE)
# Setup histories.
(epochs <- 50)
(attempts <- 10)
loss.history <- matrix(NA, epochs + 1, attempts)
error.history <- matrix(NA, epochs + 1, attempts)
tau.history <- matrix(NA, epochs + 1, attempts)
true.error.history <- matrix(NA, epochs + 1, attempts)
# Create a dataset
ds <- dataset("M1")
X <- ds$X
Y <- ds$Y
B <- ds$B # the true `B`
(k <- ncol(ds$B))
# True projection matrix.
P <- B %*% solve(t(B) %*% B) %*% t(B)
# Define the logger for the `cve()` method.
logger <- function(env) {
# Note the `<<-` assignement!
loss.history[env$epoch + 1, env$attempt] <<- env$loss
error.history[env$epoch + 1, env$attempt] <<- env$error
tau.history[env$epoch + 1, env$attempt] <<- env$tau
# Compute true error by comparing to the true `B`
B.est <- null(env$V) # Function provided by CVE
P.est <- B.est %*% solve(t(B.est) %*% B.est) %*% t(B.est)
true.error <- norm(P - P.est, 'F') / sqrt(2 * k)
true.error.history[env$epoch + 1, env$attempt] <<- true.error
}
# Performa SDR
dr <- cve(Y ~ X, k = k, logger = logger, epochs = epochs, attempts = attempts)
# Plot history's
par(mfrow = c(2, 2))
matplot(loss.history, type = 'l', log = 'y', xlab = 'iter',
main = 'loss', ylab = expression(L(V[iter])))
matplot(error.history, type = 'l', log = 'y', xlab = 'iter',
main = 'error', ylab = 'error')
matplot(tau.history, type = 'l', log = 'y', xlab = 'iter',
main = 'tau', ylab = 'tau')
matplot(true.error.history, type = 'l', log = 'y', xlab = 'iter',
main = 'true error', ylab = 'true error')
Reading log files.
The runtime tests (upcomming further tests) are creating log files saved in tmp/
. These log files are CSV
files (actualy TSV
) with a header storing the test results. Depending on the test the files may contain differnt data. As an example we use the runtime test logs which store in each line the dataset
, the used method
as well as the error
(actual error of estimated B
against real B
) and the time
. For reading and analysing the data see the following example.
# Load log as `data.frame`
log <- read.csv('tmp/test0.log', sep = '\t')
# Create a error boxplot grouped by dataset.
boxplot(error ~ dataset, log)
# Overview
for (ds.name in paste0('M', seq(5))) {
ds <- subset(log, dataset == ds.name, select = c('method', 'dataset', 'time', 'error'))
print(summary(ds))
}
Environments and variable lookup.
In the following a view simple examples of how R
searches for variables.
In addition we manipulate funciton closures to alter the search path in variable lookup and outer scope variable manipulation.
droids <- "These aren't the droids you're looking for."
search <- function() {
print(droids)
}
trooper.seeks <- function() {
droids <- c("R2-D2", "C-3PO")
search()
}
jedi.seeks <- function() {
droids <- c("R2-D2", "C-3PO")
environment(search) <- environment()
search()
}
trooper.seeks()
# [1] "These aren't the droids you're looking for."
jedi.seeks()
# [1] "R2-D2", "C-3PO"
The next example ilustrates how to write (without local copies) to variables outside the functions local environment.
counting <- function() {
count <<- count + 1 # Note the `<<-` assignment.
}
(function() {
environment(counting) <- environment()
count <- 0
for (i in 1:10) {
counting()
}
return(count)
})()
(function () {
closure <- new.env()
environment(counting) <- closure
assign("count", 0, envir = closure)
for (i in 1:10) {
counting()
}
return(closure$count)
})()
Another example for the usage of do.call
where the evaluation of parameters is illustated (example taken (and altered) from ?do.call
).
## examples of where objects will be found.
A <- "A.Global"
f <- function(x) print(paste("f.new", x))
env <- new.env()
assign("A", "A.new", envir = env)
assign("f", f, envir = env)
f <- function(x) print(paste("f.Global", x))
f(A) # f.Global A.Global
do.call("f", list(A)) # f.Global A.Global
do.call("f", list(A), envir = env) # f.new A.Global
do.call(f, list(A), envir = env) # f.Global A.Global
do.call("f", list(quote(A)), envir = env) # f.new A.new
do.call(f, list(quote(A)), envir = env) # f.Global A.new
do.call("f", list(as.name("A")), envir = env) # f.new A.new
do.call("f", list(as.name("A")), envir = env) # f.new A.new
Performance benchmarks
In this section alternative implementations of simple algorithms are compared for there performance.
Computing the trace of a matrix multiplication.
library(microbenchmark)
A <- matrix(runif(120), 12, 10)
# Check correctnes and benckmark performance.
stopifnot(
all.equal(
sum(diag(t(A) %*% A)), sum(diag(crossprod(A, A)))
),
all.equal(
sum(diag(t(A) %*% A)), sum(A * A)
)
)
microbenchmark(
MM = sum(diag(t(A) %*% A)),
cross = sum(diag(crossprod(A, A))),
elem = sum(A * A)
)
# Unit: nanoseconds
# expr min lq mean median uq max neval
# MM 4232 4570.0 5138.81 4737 4956.0 40308 100
# cross 2523 2774.5 2974.93 2946 3114.5 5078 100
# elem 582 762.5 973.02 834 964.0 12945 100
n <- 200
M <- matrix(runif(n^2), n, n)
dnorm2 <- function(x) exp(-0.5 * x^2) / sqrt(2 * pi)
stopifnot(
all.equal(dnorm(M), dnorm2(M))
)
microbenchmark(
dnorm = dnorm(M),
dnorm2 = dnorm2(M),
exp = exp(-0.5 * M^2) # without scaling -> irrelevant for usage
)
# Unit: microseconds
# expr min lq mean median uq max neval
# dnorm 841.503 843.811 920.7828 855.7505 912.4720 2405.587 100
# dnorm2 543.510 580.319 629.5321 597.8540 607.3795 2603.763 100
# exp 502.083 535.943 577.2884 548.3745 561.3280 2113.220 100
Using crosspord()
p <- 12
q <- 10
V <- matrix(runif(p * q), p, q)
stopifnot(
all.equal(V %*% t(V), tcrossprod(V)),
all.equal(V %*% t(V), tcrossprod(V, V))
)
microbenchmark(
V %*% t(V),
tcrossprod(V),
tcrossprod(V, V)
)
# Unit: microseconds
# expr min lq mean median uq max neval
# V %*% t(V) 2.293 2.6335 2.94673 2.7375 2.9060 19.592 100
# tcrossprod(V) 1.148 1.2475 1.86173 1.3440 1.4650 30.688 100
# tcrossprod(V, V) 1.003 1.1575 1.28451 1.2400 1.3685 2.742 100
Recycling vs. Sweep
(n <- 200)
(p <- 12)
(q <- 10)
X_diff <- matrix(runif(n * (n - 1) / 2 * p), n * (n - 1) / 2, p)
V <- matrix(rnorm(p * q), p, q)
vecS <- runif(n * (n - 1) / 2)
stopifnot(
all.equal((X_diff %*% V) * rep(vecS, q),
sweep(X_diff %*% V, 1, vecS, `*`)),
all.equal((X_diff %*% V) * rep(vecS, q),
(X_diff %*% V) * vecS)
)
microbenchmark(
rep = (X_diff %*% V) * rep(vecS, q),
sweep = sweep(X_diff %*% V, 1, vecS, `*`, check.margin = FALSE),
recycle = (X_diff %*% V) * vecS
)
# Unit: microseconds
# expr min lq mean median uq max neval
# rep 851.723 988.3655 1575.639 1203.6385 1440.578 18999.23 100
# sweep 1313.177 1522.4010 2355.269 1879.2605 2065.399 18783.24 100
# recycle 719.001 786.1265 1157.285 881.8825 1163.202 19091.79 100
Scaled crossprod
with matmul order.
(n <- 200)
(p <- 12)
(q <- 10)
X_diff <- matrix(runif(n * (n - 1) / 2 * p), n * (n - 1) / 2, p)
V <- matrix(rnorm(p * q), p, q)
vecS <- runif(n * (n - 1) / 2)
ref <- crossprod(X_diff, X_diff * vecS) %*% V
stopifnot(
all.equal(ref, crossprod(X_diff, (X_diff %*% V) * vecS)),
all.equal(ref, crossprod(X_diff, (X_diff %*% V) * vecS))
)
microbenchmark(
inner = crossprod(X_diff, X_diff * vecS) %*% V,
outer = crossprod(X_diff, (X_diff %*% V) * vecS)
)
# Unit: microseconds
# expr min lq mean median uq max neval
# inner 789.065 867.939 1683.812 987.9375 1290.055 16800.265 100
# outer 1141.479 1216.929 1404.702 1317.7315 1582.800 2531.766 100
Fast dist matrix computation (aka. row sum of squares).
library(microbenchmark)
library(CVE)
(n <- 200)
(N <- n * (n - 1) / 2)
(p <- 12)
M <- matrix(runif(N * p), N, p)
stopifnot(
all.equal(rowSums(M^2), rowSums.c(M^2)),
all.equal(rowSums(M^2), rowSquareSums.c(M))
)
microbenchmark(
sums = rowSums(M^2),
sums.c = rowSums.c(M^2),
sqSums.c = rowSquareSums.c(M)
)
# Unit: microseconds
# expr min lq mean median uq max neval
# sums 666.311 1051.036 1612.3100 1139.0065 1547.657 13940.97 100
# sums.c 342.647 672.453 1009.9109 740.6255 1224.715 13765.90 100
# sqSums.c 115.325 142.128 175.6242 153.4645 169.678 759.87 100
Using Rprof()
for performance.
The standart method for profiling where an algorithm is spending its time is with Rprof()
.
path <- '../tmp/R.prof' # path to profiling file
Rprof(path)
cve.res <- cve.call(X, Y, k = k)
Rprof(NULL)
(prof <- summaryRprof(path)) # Summarise results
Note: considure to run gc()
before measuring, aka cleaning up by explicitely calling the garbage collector.