Daniel Kapla
300fc11f3f
wip: datasets, notes, ... |
||
---|---|---|
CVE_C | ||
CVE_R | ||
CVE_legacy | ||
LaTeX | ||
benchmark | ||
README.md | ||
runtime_test.R | ||
test.R |
README.md
TODOs
Doc:
- Stiefel (instead of Stiefl)
- Return value description (
@returs
) - DESCRIPTION
- Maintainer
- Author
- Volume
- Description (from Paper) and Ref.
- Ref paper in doc
- Data set descriptions and augmentations.
- Demonstration of the
Logger
function usage (Demo file or so, ...) - Update Paper (to new version / version consistent with current code!)
- Reference Paper in DESCRIPTION file (in Description or specific tag)
- Split
cve
andcve.call
docs. - "Copy" form
dr
package (specificallydr.directions
-> description) - Document
C
code.
Methods to be implemented:
- simple
- weighted
- momentum
- weighted with momentum
Performance:
- Pure C implementation.
- [NOT Feasible] Stochastic Version
- [NOT Feasible] Gradient Approximations (using Algebraic Software for alternative Loss function formulations and gradient optimizations)
- [NOT Sufficient] Alternative Kernels for reducing samples
- (To Be further investigated) "Kronecker" optimization
- Implement "Kronecker" optimization
Features (functions):
- Initial
V.init
parameter (only ONE try, ignore number ofattempts
parameter) basis.cve
list of estimatedB
s (withk
supplied, onlyB
)directions.cve
ProjectedX
givenk
predict.cve
usingmars
for predicting responses given new data.predict.dim.cve
Cross-validation oraov
(in stats package) or "elbow" estimationplot.elbow
summary
- Consider
cor.test
for dimension selection - Check for user interrupt (
R_CheckUserInterrupt
)
Changes:
- New
estimate.bandwidth
implementation. (h = 2 * (tr(\Sigma) / p) * (6/5 * n^(-1 / (4 + k)))^2, \Sigma = 1/n * (X-mean)'(X-mean))
Errors:
CVE_C
compare toCVE_legacy
.- fix:
predict.dim
not found. - fix/check: error computation for break condition (in
cve.c
)
Development
Build and install.
To build the package the devtools
package is used. This also provides roxygen2
which is used for documentation and automatic creation of the NAMESPACE
file.
setwd("./CVE_R") # Set path to the package root.
library(devtools) # Load required `devtools` package.
document() # Create `.Rd` files and write `NAMESPACE`.
Next the package needs to be build, therefore (if pure R
package, aka. C/C++
, Fortran
, ... code) just do the following.
R CMD build CVE_C; R CMD INSTALL CVE_0.2.tar.gz
Then we are ready for using the package.
As well as building the NAMESPACE
and *.Rd
files using devtools
(roxygen2
) the following resembles an entire build pipeline including checks.
R -q -e 'library(devtools); setwd("CVE_C"); pkgbuild::compile_dll(); document(); pkgbuild::clean_dll()'
R CMD build CVE_C; R CMD check CVE_0.2.tar.gz;
R CMD INSTALL CVE_0.2.tar.gz
Build and install from within R
.
An alternative approach is the following.
## Installing CVE (C implementation)
(setwd('~/Projects/CVE/CVE_C'))
# equiv to Rcpp::compileAttributes().
library(devtools)
pkgbuild::compile_dll() # required for packages with C/C++ code
document() # See bug: https://github.com/stan-dev/rstantools/issues/52
pkgbuild::clean_dll()
(path <- build(vignettes = FALSE))
install.packages(path, repos = NULL, type = "source")
library(CVE)
Note: I only recommend this approach during development.
Package Structure
Demos
A demo is an .R
file that lives in demo/
. Demos are like examples but tend to
be longer. Instead of focusing on a single function, they show how to weave
together multiple functions to solve a problem.
You list and access demos with demo()
:
- Show all available demos:
demo()
. - Show all demos in a package:
demo(package = "CVE")
. - Run a specific demo:
demo("runtime_test", package = "CVE")
. - Find a demo:
system.file("demo", "runtime_test.R", package = "CVE")
.
Each demo must be listed in demo/00Index
in the following form:
demo-name Demo description
.
The demo name is the name of the file without the extension,
e.g. demo/runtime_test.R
becomes runtime_test
.
By default the demo ask for human input for each plot: "Hit to see next plot".
This behavior can be overridden by adding devAskNewPage(ask = FALSE)
to
the demo file. You can add pauses by adding:
readline("press any key to continue")
.
Note: Demos are not automatically tested by R CMD check
. This means that they
can easily break without your knowledge.
General Notes for Source Code analysis
Search in multiple files.
Using the Linux grep
program with the parameters -rnw
and specifying a include files filter like the following example.
grep --include=*\.{c,h,R} -rnw '.' -e "sweep"
searches in all C
source and header files as well as R
source files for the term sweep.
Recursive directory compare with colored structure (more or less).
diff -r CVE_R/ CVE_C/ | grep -E "^([<>]|[^<>].*)"
Parsing bash
script parameters.
usage="$0 [-v|--verbose] [-n|--dry-run] [(-s|--stack-size) <size>] [-h|--help] [-- [p1, [p2, ...]]]"
verbose=false
help=false
dry_run=false
stack_size=0
while [ $# -gt 0 ]; do
case "$1" in
-v | --verbose ) verbose=true; shift ;;
-n | --dry-run ) dry_run=true; shift ;;
-s | --stack-size ) stack_size="$2"; shift; shift ;;
-h | --help ) echo $usage; exit ;; # On help print usage and exit.
-- ) shift; break ;; # Break param "parsing".
* ) echo $usage >&2; exit 1 ;; # Print usage and exit with failure.
esac
done
echo verbose=$verbose
echo dry_run=$dry_run
echo stack_size=$stack_size
Analysis
Logging (a cve
run).
To log loss
, error
(estimated) the true error (error of current estimated B
against the true B
) or even the step size one can use the logger
parameter. A logger
is a function that gets the current environment
of the CVE optimization methods (do not alter this environment, only read from it). This can be used to create logs like in the following example.
library(CVE)
# Setup histories.
(epochs <- 50)
(attempts <- 10)
loss.history <- matrix(NA, epochs + 1, attempts)
error.history <- matrix(NA, epochs + 1, attempts)
tau.history <- matrix(NA, epochs + 1, attempts)
true.error.history <- matrix(NA, epochs + 1, attempts)
# Create a dataset
ds <- dataset("M1")
X <- ds$X
Y <- ds$Y
B <- ds$B # the true `B`
(k <- ncol(ds$B))
# True projection matrix.
P <- B %*% solve(t(B) %*% B) %*% t(B)
# Define the logger for the `cve()` method.
logger <- function(env) {
# Note the `<<-` assignement!
loss.history[env$epoch + 1, env$attempt] <<- env$loss
error.history[env$epoch + 1, env$attempt] <<- env$error
tau.history[env$epoch + 1, env$attempt] <<- env$tau
# Compute true error by comparing to the true `B`
B.est <- null(env$V) # Function provided by CVE
P.est <- B.est %*% solve(t(B.est) %*% B.est) %*% t(B.est)
true.error <- norm(P - P.est, 'F') / sqrt(2 * k)
true.error.history[env$epoch + 1, env$attempt] <<- true.error
}
# Performa SDR
dr <- cve(Y ~ X, k = k, logger = logger, epochs = epochs, attempts = attempts)
# Plot history's
par(mfrow = c(2, 2))
matplot(loss.history, type = 'l', log = 'y', xlab = 'iter',
main = 'loss', ylab = expression(L(V[iter])))
matplot(error.history, type = 'l', log = 'y', xlab = 'iter',
main = 'error', ylab = 'error')
matplot(tau.history, type = 'l', log = 'y', xlab = 'iter',
main = 'tau', ylab = 'tau')
matplot(true.error.history, type = 'l', log = 'y', xlab = 'iter',
main = 'true error', ylab = 'true error')
Reading log files.
The run-time tests (upcoming further tests) are creating log files saved in tmp/
. These log files are CSV
files (actually TSV
) with a header storing the test results. Depending on the test the files may contain different data. As an example we use the run-time test logs which store in each line the dataset
, the used method
as well as the error
(actual error of estimated B
against real B
) and the time
. For reading and analyzing the data see the following example.
# Load log as `data.frame`
log <- read.csv('tmp/test0.log', sep = '\t')
# Create a error boxplot grouped by dataset.
boxplot(error ~ dataset, log)
# Overview
for (ds.name in paste0('M', seq(5))) {
ds <- subset(log, dataset == ds.name, select = c('method', 'dataset', 'time', 'error'))
print(summary(ds))
}
Environments and variable lookup.
In the following a view simple examples of how R
searches for variables.
In addition we manipulate function closures to alter the search path in variable lookup and outer scope variable manipulation.
droids <- "These aren't the droids you're looking for."
search <- function() {
print(droids)
}
trooper.seeks <- function() {
droids <- c("R2-D2", "C-3PO")
search()
}
jedi.seeks <- function() {
droids <- c("R2-D2", "C-3PO")
environment(search) <- environment()
search()
}
trooper.seeks()
# [1] "These aren't the droids you're looking for."
jedi.seeks()
# [1] "R2-D2", "C-3PO"
The next example illustrates how to write (without local copies) to variables outside the functions local environment.
counting <- function() {
count <<- count + 1 # Note the `<<-` assignment.
}
(function() {
environment(counting) <- environment()
count <- 0
for (i in 1:10) {
counting()
}
return(count)
})()
(function () {
closure <- new.env()
environment(counting) <- closure
assign("count", 0, envir = closure)
for (i in 1:10) {
counting()
}
return(closure$count)
})()
Another example for the usage of do.call
where the evaluation of parameters is illustrated (example taken (and altered) from ?do.call
).
## examples of where objects will be found.
A <- "A.Global"
f <- function(x) print(paste("f.new", x))
env <- new.env()
assign("A", "A.new", envir = env)
assign("f", f, envir = env)
f <- function(x) print(paste("f.Global", x))
f(A) # f.Global A.Global
do.call("f", list(A)) # f.Global A.Global
do.call("f", list(A), envir = env) # f.new A.Global
do.call(f, list(A), envir = env) # f.Global A.Global
do.call("f", list(quote(A)), envir = env) # f.new A.new
do.call(f, list(quote(A)), envir = env) # f.Global A.new
do.call("f", list(as.name("A")), envir = env) # f.new A.new
do.call("f", list(as.name("A")), envir = env) # f.new A.new
Performance benchmarks
In this section alternative implementations of simple algorithms are compared for there performance.
Computing the trace of a matrix multiplication.
library(microbenchmark)
A <- matrix(runif(120), 12, 10)
# Check correctnes and benckmark performance.
stopifnot(
all.equal(
sum(diag(t(A) %*% A)), sum(diag(crossprod(A, A)))
),
all.equal(
sum(diag(t(A) %*% A)), sum(A * A)
)
)
microbenchmark(
MM = sum(diag(t(A) %*% A)),
cross = sum(diag(crossprod(A, A))),
elem = sum(A * A)
)
# Unit: nanoseconds
# expr min lq mean median uq max neval
# MM 4232 4570.0 5138.81 4737 4956.0 40308 100
# cross 2523 2774.5 2974.93 2946 3114.5 5078 100
# elem 582 762.5 973.02 834 964.0 12945 100
n <- 200
M <- matrix(runif(n^2), n, n)
dnorm2 <- function(x) exp(-0.5 * x^2) / sqrt(2 * pi)
stopifnot(
all.equal(dnorm(M), dnorm2(M))
)
microbenchmark(
dnorm = dnorm(M),
dnorm2 = dnorm2(M),
exp = exp(-0.5 * M^2) # without scaling -> irrelevant for usage
)
# Unit: microseconds
# expr min lq mean median uq max neval
# dnorm 841.503 843.811 920.7828 855.7505 912.4720 2405.587 100
# dnorm2 543.510 580.319 629.5321 597.8540 607.3795 2603.763 100
# exp 502.083 535.943 577.2884 548.3745 561.3280 2113.220 100
Using crosspord()
p <- 12
q <- 10
V <- matrix(runif(p * q), p, q)
stopifnot(
all.equal(V %*% t(V), tcrossprod(V)),
all.equal(V %*% t(V), tcrossprod(V, V))
)
microbenchmark(
V %*% t(V),
tcrossprod(V),
tcrossprod(V, V)
)
# Unit: microseconds
# expr min lq mean median uq max neval
# V %*% t(V) 2.293 2.6335 2.94673 2.7375 2.9060 19.592 100
# tcrossprod(V) 1.148 1.2475 1.86173 1.3440 1.4650 30.688 100
# tcrossprod(V, V) 1.003 1.1575 1.28451 1.2400 1.3685 2.742 100
Recycling vs. Sweep
(n <- 200)
(p <- 12)
(q <- 10)
X_diff <- matrix(runif(n * (n - 1) / 2 * p), n * (n - 1) / 2, p)
V <- matrix(rnorm(p * q), p, q)
vecS <- runif(n * (n - 1) / 2)
stopifnot(
all.equal((X_diff %*% V) * rep(vecS, q),
sweep(X_diff %*% V, 1, vecS, `*`)),
all.equal((X_diff %*% V) * rep(vecS, q),
(X_diff %*% V) * vecS)
)
microbenchmark(
rep = (X_diff %*% V) * rep(vecS, q),
sweep = sweep(X_diff %*% V, 1, vecS, `*`, check.margin = FALSE),
recycle = (X_diff %*% V) * vecS
)
# Unit: microseconds
# expr min lq mean median uq max neval
# rep 851.723 988.3655 1575.639 1203.6385 1440.578 18999.23 100
# sweep 1313.177 1522.4010 2355.269 1879.2605 2065.399 18783.24 100
# recycle 719.001 786.1265 1157.285 881.8825 1163.202 19091.79 100
Scaled crossprod
with matrix multiplication order.
(n <- 200)
(p <- 12)
(q <- 10)
X_diff <- matrix(runif(n * (n - 1) / 2 * p), n * (n - 1) / 2, p)
V <- matrix(rnorm(p * q), p, q)
vecS <- runif(n * (n - 1) / 2)
ref <- crossprod(X_diff, X_diff * vecS) %*% V
stopifnot(
all.equal(ref, crossprod(X_diff, (X_diff %*% V) * vecS)),
all.equal(ref, crossprod(X_diff, (X_diff %*% V) * vecS))
)
microbenchmark(
inner = crossprod(X_diff, X_diff * vecS) %*% V,
outer = crossprod(X_diff, (X_diff %*% V) * vecS)
)
# Unit: microseconds
# expr min lq mean median uq max neval
# inner 789.065 867.939 1683.812 987.9375 1290.055 16800.265 100
# outer 1141.479 1216.929 1404.702 1317.7315 1582.800 2531.766 100
Fast dist matrix computation (aka. row sum of squares).
library(microbenchmark)
library(CVE)
(n <- 200)
(N <- n * (n - 1) / 2)
(p <- 12)
M <- matrix(runif(N * p), N, p)
stopifnot(
all.equal(rowSums(M^2), rowSums.c(M^2)),
all.equal(rowSums(M^2), rowSquareSums.c(M))
)
microbenchmark(
sums = rowSums(M^2),
sums.c = rowSums.c(M^2),
sqSums.c = rowSquareSums.c(M)
)
# Unit: microseconds
# expr min lq mean median uq max neval
# sums 666.311 1051.036 1612.3100 1139.0065 1547.657 13940.97 100
# sums.c 342.647 672.453 1009.9109 740.6255 1224.715 13765.90 100
# sqSums.c 115.325 142.128 175.6242 153.4645 169.678 759.87 100
Using Rprof()
for performance.
The standard method for profiling where an algorithm is spending its time is with Rprof()
.
path <- '../tmp/R.prof' # path to profiling file
Rprof(path)
cve.res <- cve.call(X, Y, k = k)
Rprof(NULL)
(prof <- summaryRprof(path)) # Summarise results
Note: consider to run gc()
before measuring, aka cleaning up by explicitly calling the garbage collector.