parent
d5cd6075a9
commit
c437c60e84
473
README.md
473
README.md
|
@ -1,451 +1,34 @@
|
||||||
# TODOs
|
# Installation Instructions
|
||||||
Doc:
|
The package depends on the `mda` package (see: [mda @ cran.r-project](https://cran.r-project.org/web/packages/mda/index.html). Therefore the first step is to install the `mda` package
|
||||||
- [x] Stiefel (instead of Stiefl)
|
|
||||||
- [x] Return value description (`@returs`)
|
|
||||||
- [x] DESCRIPTION
|
|
||||||
- [x] Maintainer
|
|
||||||
- [x] Author
|
|
||||||
- [x] Volume
|
|
||||||
- [x] Description (from Paper) and Ref.
|
|
||||||
- [x] Ref paper in doc
|
|
||||||
- [ ] Data set descriptions and augmentations.
|
|
||||||
- [x] Demonstration of the `Logger` function usage (Demo file or so, ...)
|
|
||||||
- [ ] Update Paper (to new version / version consistent with current code!)
|
|
||||||
- [ ] Reference Paper in DESCRIPTION file (in Description or specific tag)
|
|
||||||
- [ ] Split `cve` and `cve.call` docs.
|
|
||||||
- [ ] "Copy" form `dr` package (specifically `dr.directions` -> description)
|
|
||||||
- [ ] Document `C` code.
|
|
||||||
|
|
||||||
Methods to be implemented:
|
|
||||||
- [x] simple
|
|
||||||
- [x] weighted
|
|
||||||
- [x] momentum
|
|
||||||
- [x] weighted with momentum
|
|
||||||
|
|
||||||
Performance:
|
|
||||||
- [x] Pure C implementation.
|
|
||||||
- [NOT Feasible] Stochastic Version
|
|
||||||
- [NOT Feasible] Gradient Approximations (using Algebraic Software for alternative Loss function formulations and gradient optimizations)
|
|
||||||
- [NOT Sufficient] Alternative Kernels for reducing samples
|
|
||||||
- [x] (To Be further investigated) "Kronecker" optimization
|
|
||||||
- [ ] Implement "Kronecker" optimization
|
|
||||||
|
|
||||||
Features (functions):
|
|
||||||
- [x] Initial `V.init` parameter (only ONE try, ignore number of `attempts` parameter)
|
|
||||||
- [x] `basis.cve` list of estimated `B`s (with `k` supplied, only `B`)
|
|
||||||
- [x] `directions.cve` Projected `X` given `k`
|
|
||||||
- [x] `predict.cve` using `mars` for predicting responses given new data.
|
|
||||||
- [x] `predict.dim.cve` Cross-validation or `aov` (in stats package) or "elbow" estimation
|
|
||||||
- [x] `plot.elbow`
|
|
||||||
- [x] `summary`
|
|
||||||
- [ ] Consider `cor.test` for dimension selection
|
|
||||||
- [x] Check for user interrupt (`R_CheckUserInterrupt`)
|
|
||||||
|
|
||||||
Changes:
|
|
||||||
- [x] New `estimate.bandwidth` implementation.
|
|
||||||
(h = 2 * (tr(\Sigma) / p) * (6/5 * n^(-1 / (4 + k)))^2,
|
|
||||||
\Sigma = 1/n * (X-mean)'(X-mean))
|
|
||||||
|
|
||||||
Errors:
|
|
||||||
- [x] `CVE_C` compare to `CVE_legacy`.
|
|
||||||
- [x] fix: `predict.dim` not found.
|
|
||||||
- [ ] fix/check: error computation for break condition (in `cve.c`)
|
|
||||||
|
|
||||||
# Development
|
|
||||||
## Build and install.
|
|
||||||
To build the package the `devtools` package is used. This also provides `roxygen2` which is used for documentation and automatic creation of the `NAMESPACE` file.
|
|
||||||
```R
|
```R
|
||||||
setwd("./CVE_R") # Set path to the package root.
|
install.packages("mda")
|
||||||
library(devtools) # Load required `devtools` package.
|
|
||||||
document() # Create `.Rd` files and write `NAMESPACE`.
|
|
||||||
```
|
```
|
||||||
Next the package needs to be build, therefore (if pure `R` package, aka. `C/C++`, `Fortran`, ... code) just do the following.
|
A release will be available in a few days.
|
||||||
|
|
||||||
|
- Linux: Not available jet (one or two days, state 17.12.2019)
|
||||||
|
- Windows: Not available jet (one or two days, state 17.12.2019)
|
||||||
|
- MacOS: Not available jet (unknown)
|
||||||
|
|
||||||
|
Open `R` and then the following:
|
||||||
|
```R
|
||||||
|
# addapt to download file.
|
||||||
|
install.packages("path/to/cve_0.2.<end>", repos = NULL)
|
||||||
|
library(CVE) # Test installation.
|
||||||
|
```
|
||||||
|
Please consult the man-pages `?install.package` and `?library` for further information.
|
||||||
|
|
||||||
|
## Installing Source
|
||||||
|
Cloning the `CVE` repository and using `R`'s build and install routines from a terminal.
|
||||||
```bash
|
```bash
|
||||||
R CMD build CVE_C; R CMD INSTALL CVE_0.2.tar.gz
|
git clone https://git.art-ist.cc/daniel/CVE.git # Clone repository
|
||||||
```
|
cd CVE # Go into the repository
|
||||||
Then we are ready for using the package.
|
R CMD build CVE # Build package tarbal
|
||||||
As well as building the `NAMESPACE` and `*.Rd` files using `devtools` (`roxygen2`) the following resembles an entire build pipeline including checks.
|
R CMD INSTALL CVE_0.2.tar.gz # Install package
|
||||||
```bash
|
|
||||||
R -q -e 'library(devtools); setwd("CVE_C"); pkgbuild::compile_dll(); document(); pkgbuild::clean_dll()'
|
|
||||||
R CMD build CVE_C; R CMD check CVE_0.2.tar.gz;
|
|
||||||
R CMD INSTALL CVE_0.2.tar.gz
|
|
||||||
```
|
|
||||||
## Build and install from within `R`.
|
|
||||||
An alternative approach is the following.
|
|
||||||
```R
|
|
||||||
## Installing CVE (C implementation)
|
|
||||||
(setwd('~/Projects/CVE/CVE_C'))
|
|
||||||
# equiv to Rcpp::compileAttributes().
|
|
||||||
library(devtools)
|
|
||||||
pkgbuild::compile_dll() # required for packages with C/C++ code
|
|
||||||
document() # See bug: https://github.com/stan-dev/rstantools/issues/52
|
|
||||||
pkgbuild::clean_dll()
|
|
||||||
(path <- build(vignettes = FALSE))
|
|
||||||
install.packages(path, repos = NULL, type = "source")
|
|
||||||
library(CVE)
|
|
||||||
```
|
|
||||||
**Note: I only recommend this approach during development.**
|
|
||||||
|
|
||||||
# Package Structure
|
|
||||||
|
|
||||||
## Demos
|
|
||||||
A demo is an `.R` file that lives in `demo/`. Demos are like examples but tend to
|
|
||||||
be longer. Instead of focusing on a single function, they show how to weave
|
|
||||||
together multiple functions to solve a problem.
|
|
||||||
|
|
||||||
You list and access demos with `demo()`:
|
|
||||||
* Show all available demos: `demo()`.
|
|
||||||
* Show all demos in a package: `demo(package = "CVE")`.
|
|
||||||
* Run a specific demo: `demo("runtime_test", package = "CVE")`.
|
|
||||||
* Find a demo: `system.file("demo", "runtime_test.R", package = "CVE")`.
|
|
||||||
|
|
||||||
Each demo must be listed in `demo/00Index` in the following form:
|
|
||||||
`demo-name Demo description`.
|
|
||||||
The demo name is the name of the file without the extension,
|
|
||||||
e.g. `demo/runtime_test.R` becomes `runtime_test`.
|
|
||||||
|
|
||||||
By default the demo ask for human input for each plot: "Hit to see next plot".
|
|
||||||
This behavior can be overridden by adding `devAskNewPage(ask = FALSE)` to
|
|
||||||
the demo file. You can add pauses by adding:
|
|
||||||
`readline("press any key to continue")`.
|
|
||||||
|
|
||||||
**Note**: Demos are not automatically tested by `R CMD check`. This means that they
|
|
||||||
can easily break without your knowledge.
|
|
||||||
|
|
||||||
|
|
||||||
# General Notes for Source Code analysis
|
|
||||||
## Search in multiple files.
|
|
||||||
Using the Linux `grep` program with the parameters `-rnw` and specifying a include files filter like the following example.
|
|
||||||
```bash
|
|
||||||
grep --include=*\.{c,h,R} -rnw '.' -e "sweep"
|
|
||||||
```
|
|
||||||
searches in all `C` source and header files as well as `R` source files for the term _sweep_.
|
|
||||||
|
|
||||||
## Recursive directory compare with colored structure (more or less).
|
|
||||||
```bash
|
|
||||||
diff -r CVE_R/ CVE_C/ | grep -E "^([<>]|[^<>].*)"
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Parsing `bash` script parameters.
|
### Windows / macOS
|
||||||
```bash
|
Installing from source (for any package which contains compiled code, in our case `C`) on Windows and MacOS requires additional tools.
|
||||||
usage="$0 [-v|--verbose] [-n|--dry-run] [(-s|--stack-size) <size>] [-h|--help] [-- [p1, [p2, ...]]]"
|
See _R Installation and Administration_ from [r-project manuals](https://cran.r-project.org/manuals.html).
|
||||||
verbose=false
|
|
||||||
help=false
|
|
||||||
dry_run=false
|
|
||||||
stack_size=0
|
|
||||||
|
|
||||||
while [ $# -gt 0 ]; do
|
# Repository Structure
|
||||||
case "$1" in
|
The repository is structured in two directories, the `CVE/` directory which is the `R` package root and `simulations/` where all simulation scripts can be found (and `README.md` which is this).
|
||||||
-v | --verbose ) verbose=true; shift ;;
|
|
||||||
-n | --dry-run ) dry_run=true; shift ;;
|
|
||||||
-s | --stack-size ) stack_size="$2"; shift; shift ;;
|
|
||||||
-h | --help ) echo $usage; exit ;; # On help print usage and exit.
|
|
||||||
-- ) shift; break ;; # Break param "parsing".
|
|
||||||
* ) echo $usage >&2; exit 1 ;; # Print usage and exit with failure.
|
|
||||||
esac
|
|
||||||
done
|
|
||||||
|
|
||||||
echo verbose=$verbose
|
|
||||||
echo dry_run=$dry_run
|
|
||||||
echo stack_size=$stack_size
|
|
||||||
```
|
|
||||||
|
|
||||||
# Analysis
|
|
||||||
## Logging (a `cve` run).
|
|
||||||
To log `loss`, `error` (estimated) the true error (error of current estimated `B` against the true `B`) or even the step size one can use the `logger` parameter. A `logger` is a function that gets the current `environment` of the CVE optimization methods (__do not alter this environment, only read from it__). This can be used to create logs like in the following example.
|
|
||||||
```R
|
|
||||||
library(CVE)
|
|
||||||
|
|
||||||
# Setup histories.
|
|
||||||
(epochs <- 50)
|
|
||||||
(attempts <- 10)
|
|
||||||
loss.history <- matrix(NA, epochs + 1, attempts)
|
|
||||||
error.history <- matrix(NA, epochs + 1, attempts)
|
|
||||||
tau.history <- matrix(NA, epochs + 1, attempts)
|
|
||||||
true.error.history <- matrix(NA, epochs + 1, attempts)
|
|
||||||
|
|
||||||
# Create a dataset
|
|
||||||
ds <- dataset("M1")
|
|
||||||
X <- ds$X
|
|
||||||
Y <- ds$Y
|
|
||||||
B <- ds$B # the true `B`
|
|
||||||
(k <- ncol(ds$B))
|
|
||||||
|
|
||||||
# True projection matrix.
|
|
||||||
P <- B %*% solve(t(B) %*% B) %*% t(B)
|
|
||||||
# Define the logger for the `cve()` method.
|
|
||||||
logger <- function(env) {
|
|
||||||
# Note the `<<-` assignement!
|
|
||||||
loss.history[env$epoch + 1, env$attempt] <<- env$loss
|
|
||||||
error.history[env$epoch + 1, env$attempt] <<- env$error
|
|
||||||
tau.history[env$epoch + 1, env$attempt] <<- env$tau
|
|
||||||
# Compute true error by comparing to the true `B`
|
|
||||||
B.est <- null(env$V) # Function provided by CVE
|
|
||||||
P.est <- B.est %*% solve(t(B.est) %*% B.est) %*% t(B.est)
|
|
||||||
true.error <- norm(P - P.est, 'F') / sqrt(2 * k)
|
|
||||||
true.error.history[env$epoch + 1, env$attempt] <<- true.error
|
|
||||||
}
|
|
||||||
# Performa SDR
|
|
||||||
dr <- cve(Y ~ X, k = k, logger = logger, epochs = epochs, attempts = attempts)
|
|
||||||
# Plot history's
|
|
||||||
par(mfrow = c(2, 2))
|
|
||||||
matplot(loss.history, type = 'l', log = 'y', xlab = 'iter',
|
|
||||||
main = 'loss', ylab = expression(L(V[iter])))
|
|
||||||
matplot(error.history, type = 'l', log = 'y', xlab = 'iter',
|
|
||||||
main = 'error', ylab = 'error')
|
|
||||||
matplot(tau.history, type = 'l', log = 'y', xlab = 'iter',
|
|
||||||
main = 'tau', ylab = 'tau')
|
|
||||||
matplot(true.error.history, type = 'l', log = 'y', xlab = 'iter',
|
|
||||||
main = 'true error', ylab = 'true error')
|
|
||||||
```
|
|
||||||
|
|
||||||
## Reading log files.
|
|
||||||
The run-time tests (upcoming further tests) are creating log files saved in `tmp/`. These log files are `CSV` files (actually `TSV`) with a header storing the test results. Depending on the test the files may contain different data. As an example we use the run-time test logs which store in each line the `dataset`, the used `method` as well as the `error` (actual error of estimated `B` against real `B`) and the `time`. For reading and analyzing the data see the following example.
|
|
||||||
```R
|
|
||||||
# Load log as `data.frame`
|
|
||||||
log <- read.csv('tmp/test0.log', sep = '\t')
|
|
||||||
# Create a error boxplot grouped by dataset.
|
|
||||||
boxplot(error ~ dataset, log)
|
|
||||||
|
|
||||||
# Overview
|
|
||||||
for (ds.name in paste0('M', seq(5))) {
|
|
||||||
ds <- subset(log, dataset == ds.name, select = c('method', 'dataset', 'time', 'error'))
|
|
||||||
print(summary(ds))
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
## Environments and variable lookup.
|
|
||||||
In the following a view simple examples of how `R` searches for variables.
|
|
||||||
In addition we manipulate function closures to alter the search path in variable lookup and outer scope variable manipulation.
|
|
||||||
```R
|
|
||||||
droids <- "These aren't the droids you're looking for."
|
|
||||||
|
|
||||||
search <- function() {
|
|
||||||
print(droids)
|
|
||||||
}
|
|
||||||
|
|
||||||
trooper.seeks <- function() {
|
|
||||||
droids <- c("R2-D2", "C-3PO")
|
|
||||||
search()
|
|
||||||
}
|
|
||||||
|
|
||||||
jedi.seeks <- function() {
|
|
||||||
droids <- c("R2-D2", "C-3PO")
|
|
||||||
environment(search) <- environment()
|
|
||||||
search()
|
|
||||||
}
|
|
||||||
|
|
||||||
trooper.seeks()
|
|
||||||
# [1] "These aren't the droids you're looking for."
|
|
||||||
jedi.seeks()
|
|
||||||
# [1] "R2-D2", "C-3PO"
|
|
||||||
```
|
|
||||||
|
|
||||||
The next example illustrates how to write (without local copies) to variables outside the functions local environment.
|
|
||||||
```R
|
|
||||||
counting <- function() {
|
|
||||||
count <<- count + 1 # Note the `<<-` assignment.
|
|
||||||
}
|
|
||||||
|
|
||||||
(function() {
|
|
||||||
environment(counting) <- environment()
|
|
||||||
count <- 0
|
|
||||||
|
|
||||||
for (i in 1:10) {
|
|
||||||
counting()
|
|
||||||
}
|
|
||||||
|
|
||||||
return(count)
|
|
||||||
})()
|
|
||||||
|
|
||||||
(function () {
|
|
||||||
closure <- new.env()
|
|
||||||
environment(counting) <- closure
|
|
||||||
assign("count", 0, envir = closure)
|
|
||||||
|
|
||||||
for (i in 1:10) {
|
|
||||||
counting()
|
|
||||||
}
|
|
||||||
|
|
||||||
return(closure$count)
|
|
||||||
})()
|
|
||||||
```
|
|
||||||
|
|
||||||
Another example for the usage of `do.call` where the evaluation of parameters is illustrated (example taken (and altered) from `?do.call`).
|
|
||||||
```R
|
|
||||||
## examples of where objects will be found.
|
|
||||||
A <- "A.Global"
|
|
||||||
f <- function(x) print(paste("f.new", x))
|
|
||||||
env <- new.env()
|
|
||||||
assign("A", "A.new", envir = env)
|
|
||||||
assign("f", f, envir = env)
|
|
||||||
f <- function(x) print(paste("f.Global", x))
|
|
||||||
f(A) # f.Global A.Global
|
|
||||||
do.call("f", list(A)) # f.Global A.Global
|
|
||||||
do.call("f", list(A), envir = env) # f.new A.Global
|
|
||||||
do.call(f, list(A), envir = env) # f.Global A.Global
|
|
||||||
do.call("f", list(quote(A)), envir = env) # f.new A.new
|
|
||||||
do.call(f, list(quote(A)), envir = env) # f.Global A.new
|
|
||||||
do.call("f", list(as.name("A")), envir = env) # f.new A.new
|
|
||||||
do.call("f", list(as.name("A")), envir = env) # f.new A.new
|
|
||||||
```
|
|
||||||
|
|
||||||
# Performance benchmarks
|
|
||||||
In this section alternative implementations of simple algorithms are compared for there performance.
|
|
||||||
|
|
||||||
### Computing the trace of a matrix multiplication.
|
|
||||||
```R
|
|
||||||
library(microbenchmark)
|
|
||||||
|
|
||||||
A <- matrix(runif(120), 12, 10)
|
|
||||||
|
|
||||||
# Check correctnes and benckmark performance.
|
|
||||||
stopifnot(
|
|
||||||
all.equal(
|
|
||||||
sum(diag(t(A) %*% A)), sum(diag(crossprod(A, A)))
|
|
||||||
),
|
|
||||||
all.equal(
|
|
||||||
sum(diag(t(A) %*% A)), sum(A * A)
|
|
||||||
)
|
|
||||||
)
|
|
||||||
microbenchmark(
|
|
||||||
MM = sum(diag(t(A) %*% A)),
|
|
||||||
cross = sum(diag(crossprod(A, A))),
|
|
||||||
elem = sum(A * A)
|
|
||||||
)
|
|
||||||
# Unit: nanoseconds
|
|
||||||
# expr min lq mean median uq max neval
|
|
||||||
# MM 4232 4570.0 5138.81 4737 4956.0 40308 100
|
|
||||||
# cross 2523 2774.5 2974.93 2946 3114.5 5078 100
|
|
||||||
# elem 582 762.5 973.02 834 964.0 12945 100
|
|
||||||
```
|
|
||||||
|
|
||||||
```R
|
|
||||||
n <- 200
|
|
||||||
M <- matrix(runif(n^2), n, n)
|
|
||||||
|
|
||||||
dnorm2 <- function(x) exp(-0.5 * x^2) / sqrt(2 * pi)
|
|
||||||
|
|
||||||
stopifnot(
|
|
||||||
all.equal(dnorm(M), dnorm2(M))
|
|
||||||
)
|
|
||||||
microbenchmark(
|
|
||||||
dnorm = dnorm(M),
|
|
||||||
dnorm2 = dnorm2(M),
|
|
||||||
exp = exp(-0.5 * M^2) # without scaling -> irrelevant for usage
|
|
||||||
)
|
|
||||||
# Unit: microseconds
|
|
||||||
# expr min lq mean median uq max neval
|
|
||||||
# dnorm 841.503 843.811 920.7828 855.7505 912.4720 2405.587 100
|
|
||||||
# dnorm2 543.510 580.319 629.5321 597.8540 607.3795 2603.763 100
|
|
||||||
# exp 502.083 535.943 577.2884 548.3745 561.3280 2113.220 100
|
|
||||||
```
|
|
||||||
|
|
||||||
### Using `crosspord()`
|
|
||||||
```R
|
|
||||||
p <- 12
|
|
||||||
q <- 10
|
|
||||||
V <- matrix(runif(p * q), p, q)
|
|
||||||
|
|
||||||
stopifnot(
|
|
||||||
all.equal(V %*% t(V), tcrossprod(V)),
|
|
||||||
all.equal(V %*% t(V), tcrossprod(V, V))
|
|
||||||
)
|
|
||||||
microbenchmark(
|
|
||||||
V %*% t(V),
|
|
||||||
tcrossprod(V),
|
|
||||||
tcrossprod(V, V)
|
|
||||||
)
|
|
||||||
# Unit: microseconds
|
|
||||||
# expr min lq mean median uq max neval
|
|
||||||
# V %*% t(V) 2.293 2.6335 2.94673 2.7375 2.9060 19.592 100
|
|
||||||
# tcrossprod(V) 1.148 1.2475 1.86173 1.3440 1.4650 30.688 100
|
|
||||||
# tcrossprod(V, V) 1.003 1.1575 1.28451 1.2400 1.3685 2.742 100
|
|
||||||
```
|
|
||||||
|
|
||||||
### Recycling vs. Sweep
|
|
||||||
```R
|
|
||||||
(n <- 200)
|
|
||||||
(p <- 12)
|
|
||||||
(q <- 10)
|
|
||||||
X_diff <- matrix(runif(n * (n - 1) / 2 * p), n * (n - 1) / 2, p)
|
|
||||||
V <- matrix(rnorm(p * q), p, q)
|
|
||||||
vecS <- runif(n * (n - 1) / 2)
|
|
||||||
|
|
||||||
stopifnot(
|
|
||||||
all.equal((X_diff %*% V) * rep(vecS, q),
|
|
||||||
sweep(X_diff %*% V, 1, vecS, `*`)),
|
|
||||||
all.equal((X_diff %*% V) * rep(vecS, q),
|
|
||||||
(X_diff %*% V) * vecS)
|
|
||||||
)
|
|
||||||
microbenchmark(
|
|
||||||
rep = (X_diff %*% V) * rep(vecS, q),
|
|
||||||
sweep = sweep(X_diff %*% V, 1, vecS, `*`, check.margin = FALSE),
|
|
||||||
recycle = (X_diff %*% V) * vecS
|
|
||||||
)
|
|
||||||
# Unit: microseconds
|
|
||||||
# expr min lq mean median uq max neval
|
|
||||||
# rep 851.723 988.3655 1575.639 1203.6385 1440.578 18999.23 100
|
|
||||||
# sweep 1313.177 1522.4010 2355.269 1879.2605 2065.399 18783.24 100
|
|
||||||
# recycle 719.001 786.1265 1157.285 881.8825 1163.202 19091.79 100
|
|
||||||
```
|
|
||||||
### Scaled `crossprod` with matrix multiplication order.
|
|
||||||
```R
|
|
||||||
(n <- 200)
|
|
||||||
(p <- 12)
|
|
||||||
(q <- 10)
|
|
||||||
X_diff <- matrix(runif(n * (n - 1) / 2 * p), n * (n - 1) / 2, p)
|
|
||||||
V <- matrix(rnorm(p * q), p, q)
|
|
||||||
vecS <- runif(n * (n - 1) / 2)
|
|
||||||
|
|
||||||
ref <- crossprod(X_diff, X_diff * vecS) %*% V
|
|
||||||
stopifnot(
|
|
||||||
all.equal(ref, crossprod(X_diff, (X_diff %*% V) * vecS)),
|
|
||||||
all.equal(ref, crossprod(X_diff, (X_diff %*% V) * vecS))
|
|
||||||
)
|
|
||||||
microbenchmark(
|
|
||||||
inner = crossprod(X_diff, X_diff * vecS) %*% V,
|
|
||||||
outer = crossprod(X_diff, (X_diff %*% V) * vecS)
|
|
||||||
)
|
|
||||||
# Unit: microseconds
|
|
||||||
# expr min lq mean median uq max neval
|
|
||||||
# inner 789.065 867.939 1683.812 987.9375 1290.055 16800.265 100
|
|
||||||
# outer 1141.479 1216.929 1404.702 1317.7315 1582.800 2531.766 100
|
|
||||||
```
|
|
||||||
|
|
||||||
### Fast dist matrix computation (aka. row sum of squares).
|
|
||||||
```R
|
|
||||||
library(microbenchmark)
|
|
||||||
library(CVE)
|
|
||||||
|
|
||||||
(n <- 200)
|
|
||||||
(N <- n * (n - 1) / 2)
|
|
||||||
(p <- 12)
|
|
||||||
M <- matrix(runif(N * p), N, p)
|
|
||||||
|
|
||||||
stopifnot(
|
|
||||||
all.equal(rowSums(M^2), rowSums.c(M^2)),
|
|
||||||
all.equal(rowSums(M^2), rowSquareSums.c(M))
|
|
||||||
)
|
|
||||||
microbenchmark(
|
|
||||||
sums = rowSums(M^2),
|
|
||||||
sums.c = rowSums.c(M^2),
|
|
||||||
sqSums.c = rowSquareSums.c(M)
|
|
||||||
)
|
|
||||||
# Unit: microseconds
|
|
||||||
# expr min lq mean median uq max neval
|
|
||||||
# sums 666.311 1051.036 1612.3100 1139.0065 1547.657 13940.97 100
|
|
||||||
# sums.c 342.647 672.453 1009.9109 740.6255 1224.715 13765.90 100
|
|
||||||
# sqSums.c 115.325 142.128 175.6242 153.4645 169.678 759.87 100
|
|
||||||
```
|
|
||||||
|
|
||||||
## Using `Rprof()` for performance.
|
|
||||||
The standard method for profiling where an algorithm is spending its time is with `Rprof()`.
|
|
||||||
```R
|
|
||||||
path <- '../tmp/R.prof' # path to profiling file
|
|
||||||
Rprof(path)
|
|
||||||
cve.res <- cve.call(X, Y, k = k)
|
|
||||||
Rprof(NULL)
|
|
||||||
(prof <- summaryRprof(path)) # Summarise results
|
|
||||||
```
|
|
||||||
**Note: consider to run `gc()` before measuring**, aka cleaning up by explicitly calling the garbage collector.
|
|
||||||
|
|
|
@ -0,0 +1,80 @@
|
||||||
|
library(ISLR) # for Hitters dataset
|
||||||
|
library(MAVE)
|
||||||
|
library(CVE)
|
||||||
|
|
||||||
|
# Set global parameters.
|
||||||
|
seed <- 21
|
||||||
|
max.dim <- 5L
|
||||||
|
attempts <- 25L
|
||||||
|
max.iter <- 100L
|
||||||
|
# momentum <- 0.0
|
||||||
|
|
||||||
|
# Prepair data for analysis.
|
||||||
|
cols <- c("Salary", "AtBat", "Hits", "HmRun", "Runs", "RBI",
|
||||||
|
"Walks", "Years", "CAtBat","CHits", "CHmRun","CRuns",
|
||||||
|
"CRBI", "CWalks", "PutOuts", "Assists", "Errors")
|
||||||
|
outliers <- c(92, # Gary Pettis
|
||||||
|
120, # John Moses
|
||||||
|
173, # Milt Thompson
|
||||||
|
189, # Rick Burleson
|
||||||
|
220, # Scott Fletcher
|
||||||
|
230, # Tom Foley
|
||||||
|
241) # Terry Puhl
|
||||||
|
# Subselect as a matrix without outliers as well as reordered
|
||||||
|
# and filtered columns.
|
||||||
|
ds <- na.omit(Hitters[, cols])[-outliers, ]
|
||||||
|
ds$Salary <- log(ds$Salary)
|
||||||
|
ds <- scale(ds, center = TRUE, scale = TRUE)
|
||||||
|
# Split into data and responce.
|
||||||
|
X <- as.matrix(ds[, colnames(ds) != "Salary"])
|
||||||
|
Y <- as.matrix(ds[, "Salary"])
|
||||||
|
|
||||||
|
|
||||||
|
pdf("results/hitters.pdf", width = 15, height = 5)
|
||||||
|
layout(matrix(c(1, 2), nrow = 1))
|
||||||
|
cat("Seed: ", seed, "\n\n")
|
||||||
|
################################################################################
|
||||||
|
### meanMAVE ###
|
||||||
|
################################################################################
|
||||||
|
set.seed(seed)
|
||||||
|
|
||||||
|
dr <- mave(Y ~ X, method = "meanMAVE", max.dim = max.dim)
|
||||||
|
dim <- which.min(mave.dim(dr)$cv)
|
||||||
|
B <- coef(dr, 2L)
|
||||||
|
# train linear model on reduced data
|
||||||
|
data <- as.data.frame(cbind(Y, X %*% B))
|
||||||
|
colnames(data) <- c("Y", "dir1", "dir2")
|
||||||
|
model <- lm(Y ~ dir1 + I(dir1^2) + dir2, data = data)
|
||||||
|
# Print and log.
|
||||||
|
cat("### meanMAVE\n\nest.dim: ", dim, '\n\n')
|
||||||
|
summary(model)
|
||||||
|
plot(x = data[, "dir1"], y = data[, "Y"],
|
||||||
|
main = "meanMave", xlab = "dir1", ylab = "Y")
|
||||||
|
plot(x = data[, "dir2"], y = data[, "Y"],
|
||||||
|
main = "meanMave", xlab = "dir2", ylab = "Y")
|
||||||
|
|
||||||
|
################################################################################
|
||||||
|
### CVE ###
|
||||||
|
################################################################################
|
||||||
|
set.seed(seed)
|
||||||
|
|
||||||
|
dr <- cve(Y ~ X, max.dim = max.dim, max.iter = max.iter, attempts = attempts)
|
||||||
|
dim <- predict_dim(dr, method = "cv")$k
|
||||||
|
B <- coef(dr, 2L)
|
||||||
|
# Determine 1st direction (cause CVE does not care for column order).
|
||||||
|
if (which.max(abs(t(coef(dr, 1L)) %*% B)) == 2) {
|
||||||
|
B <- B[, c(2, 1)] # switch directions
|
||||||
|
}
|
||||||
|
# train linear model on reduced data
|
||||||
|
data <- as.data.frame(cbind(Y, X %*% B))
|
||||||
|
colnames(data) <- c("Y", "dir1", "dir2")
|
||||||
|
model <- lm(Y ~ dir1 + I(dir1^2) + dir2, data = data)
|
||||||
|
# Print and log.
|
||||||
|
cat("### CVE\n\nest.dim: ", dim, '\n\n')
|
||||||
|
summary(model)
|
||||||
|
plot(x = data[, "dir1"], y = data[, "Y"],
|
||||||
|
main = "CVE", xlab = "dir1", ylab = "Y")
|
||||||
|
plot(x = data[, "dir2"], y = data[, "Y"],
|
||||||
|
main = "CVE", xlab = "dir2", ylab = "Y")
|
||||||
|
|
||||||
|
dev.off()
|
|
@ -0,0 +1,108 @@
|
||||||
|
library(ISLR) # for Hitters dataset
|
||||||
|
library(MAVE)
|
||||||
|
library(CVE)
|
||||||
|
|
||||||
|
# Set global parameters.
|
||||||
|
seed <- 21
|
||||||
|
max.dim <- 5L
|
||||||
|
attempts <- 25L
|
||||||
|
max.iter <- 100L
|
||||||
|
momentum <- 0.0
|
||||||
|
name <- "Hitters"
|
||||||
|
set.seed(seed)
|
||||||
|
|
||||||
|
# Prepair data for analysis.
|
||||||
|
cols <- c("Salary", "AtBat", "Hits", "HmRun", "Runs", "RBI",
|
||||||
|
"Walks", "Years", "CAtBat","CHits", "CHmRun","CRuns",
|
||||||
|
"CRBI", "CWalks", "PutOuts", "Assists", "Errors")
|
||||||
|
outliers <- c(92, # Gary Pettis
|
||||||
|
120, # John Moses
|
||||||
|
173, # Milt Thompson
|
||||||
|
189, # Rick Burleson
|
||||||
|
220, # Scott Fletcher
|
||||||
|
230, # Tom Foley
|
||||||
|
241) # Terry Puhl
|
||||||
|
# Subselect as a matrix without outliers as well as reordered
|
||||||
|
# and filtered columns.
|
||||||
|
ds <- na.omit(Hitters[, cols])[-outliers, ]
|
||||||
|
ds$Salary <- log(ds$Salary)
|
||||||
|
ds <- scale(ds, center = TRUE, scale = TRUE)
|
||||||
|
# Split into data and responce.
|
||||||
|
X <- as.matrix(ds[, colnames(ds) != "Salary"])
|
||||||
|
Y <- as.matrix(ds[, "Salary"])
|
||||||
|
|
||||||
|
path <- file.path(getwd(), 'results', 'hitters_logger.pdf')
|
||||||
|
pdf(path, width = 8.27, height = 11.7) # width, height unit is inces -> A4
|
||||||
|
layout(matrix(c(1, 1,
|
||||||
|
2, 3,
|
||||||
|
4, 5), nrow = 3, byrow = TRUE))
|
||||||
|
|
||||||
|
|
||||||
|
# Setup histories.
|
||||||
|
loss.history <- matrix(NA, max.iter + 1, attempts)
|
||||||
|
error.history <- matrix(NA, max.iter + 1, attempts)
|
||||||
|
tau.history <- matrix(NA, max.iter + 1, attempts)
|
||||||
|
grad.norm.history <- matrix(NA, max.iter + 1, attempts)
|
||||||
|
# Define logger for `cve()` method.
|
||||||
|
logger <- function(attempt, iter, data) {
|
||||||
|
# Note the `<<-` assignement!
|
||||||
|
loss.history[iter + 1, attempt] <<- data$loss
|
||||||
|
error.history[iter + 1, attempt] <<- data$err
|
||||||
|
tau.history[iter + 1, attempt] <<- data$tau
|
||||||
|
grad.norm.history[iter + 1, attempt] <<- norm(data$G, 'F')
|
||||||
|
}
|
||||||
|
|
||||||
|
dr <- cve(Y ~ X, k = 2L, max.iter = max.iter, attempts = attempts,
|
||||||
|
logger = logger)
|
||||||
|
B <- coef(dr, 2L)
|
||||||
|
loss <- dr$res[['2']]$loss
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
textplot <- function(...) {
|
||||||
|
text <- unlist(list(...))
|
||||||
|
if (length(text) > 20) {
|
||||||
|
text <- c(text[1:17],
|
||||||
|
' ...... (skipped, text too long) ......',
|
||||||
|
text[c(-1, 0) + length(text)])
|
||||||
|
}
|
||||||
|
|
||||||
|
plot(NA, xlim = c(0, 1), ylim = c(0, 1),
|
||||||
|
bty = 'n', xaxt = 'n', yaxt = 'n', xlab = '', ylab = '')
|
||||||
|
|
||||||
|
for (i in seq_along(text)) {
|
||||||
|
text(0, 1 - (i / 20),
|
||||||
|
text[[i]], pos = 4)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
# Write metadata.
|
||||||
|
textplot(
|
||||||
|
paste0("Seed value: ", seed),
|
||||||
|
"",
|
||||||
|
paste0("Dataset Name: ", name),
|
||||||
|
paste0("dim(X) = (", nrow(X), ", ", ncol(X), ")"),
|
||||||
|
paste0("dim(B) = (", nrow(B), ", ", ncol(B), ")"),
|
||||||
|
"",
|
||||||
|
paste0("CVE method: ", dr$method),
|
||||||
|
paste0("Max Iterations: ", max.iter),
|
||||||
|
paste0("Attempts: ", attempts),
|
||||||
|
paste0("Momentum: ", momentum),
|
||||||
|
"CVE call:",
|
||||||
|
paste0(" > ", format(dr$call)),
|
||||||
|
"",
|
||||||
|
paste0("True Error: ", NA),
|
||||||
|
paste0("loss: ", round(loss, 3))
|
||||||
|
)
|
||||||
|
# Plot history's
|
||||||
|
matplot(loss.history, type = 'l', log = 'y', xlab = 'i (iteration)',
|
||||||
|
main = paste('loss', name),
|
||||||
|
ylab = expression(L(V[i])))
|
||||||
|
matplot(grad.norm.history, type = 'l', log = 'y', xlab = 'i (iteration)',
|
||||||
|
main = paste('gradient norm', name),
|
||||||
|
ylab = expression(group('|', paste(nabla, L(V)), '|')[F]))
|
||||||
|
matplot(error.history, type = 'l', log = 'y', xlab = 'i (iteration)',
|
||||||
|
main = paste('error', name),
|
||||||
|
ylab = expression(group('|', V[i-1]*V[i-1]^T - V[i]*V[i]^T, '|')[F]))
|
||||||
|
matplot(tau.history, type = 'l', log = 'y', xlab = 'i (iteration)',
|
||||||
|
main = paste('learning rate', name),
|
||||||
|
ylab = expression(tau[i]))
|
|
@ -0,0 +1,61 @@
|
||||||
|
# Compairson of differend dimension reduction (dr) methods over all datasets
|
||||||
|
# with default parametrs.
|
||||||
|
#
|
||||||
|
# Results are written to a CSV file with a dataset column and its simulation nr.
|
||||||
|
#
|
||||||
|
# Note:
|
||||||
|
# All methods are called via interface functions defined in
|
||||||
|
# 'simulation_interfaces.R'. In addition this file provides a few convenience
|
||||||
|
# functions, namely: 'progress_logger' and 'subspace_dist'.
|
||||||
|
#
|
||||||
|
# Dependencies (list of packages loaded):
|
||||||
|
# - MASS
|
||||||
|
# - dr
|
||||||
|
# - MAVE
|
||||||
|
# - CVE
|
||||||
|
source('simulation_interfaces.R') # depends on MASS, dr, MAVE and CVE
|
||||||
|
|
||||||
|
# Number of simulations
|
||||||
|
NR.SIM <- 100L
|
||||||
|
DS <- 1L:7L # datasets run from 1 to 7.
|
||||||
|
|
||||||
|
# Methods for comparison, see 'simulation_interfaces.R' for their definitions.
|
||||||
|
methods <- c('CVE', 'wCVE', 'rCVE', # CVE variants
|
||||||
|
'meanOPG', 'rOPG', # OPG variants
|
||||||
|
'meanMAVE', 'rmave', # MAVE variants
|
||||||
|
'phdy', 'sir', 'save') # Methods from dr package
|
||||||
|
|
||||||
|
# Build result data matrix (repeat each dataset simulation times).
|
||||||
|
result <- matrix(NA, nrow = length(DS) * NR.SIM,
|
||||||
|
ncol = length(methods) + 1) # +1 for dataset column.
|
||||||
|
result[, 1L] <- rep(paste0("M", DS), NR.SIM)
|
||||||
|
colnames(result) <- c("dataset", methods)
|
||||||
|
|
||||||
|
# Create a progress logger for length many reports.
|
||||||
|
logger <- progress_logger(length(DS) * NR.SIM * length(methods))
|
||||||
|
for (sim in seq_len(nrow(result))) {
|
||||||
|
# Create new dataset.
|
||||||
|
with(dataset(result[sim, 1L]), { X <<- X; Y <<- Y; B <<- B })
|
||||||
|
|
||||||
|
# And each reference method.
|
||||||
|
for (method in methods) {
|
||||||
|
# Report simulation progress to user.
|
||||||
|
logger(method)
|
||||||
|
|
||||||
|
# Try/Catch to avoid simulation stop on error, this should not
|
||||||
|
# accure for CVE but may be the case for others like OPG.
|
||||||
|
tryCatch({
|
||||||
|
# Call dimension reduction method.
|
||||||
|
dr <- eval(call(method, X = X, Y = Y, k = ncol(B)))
|
||||||
|
# Compute distance of estimated `B` to true `B`.
|
||||||
|
result[sim, method] <- subspace_dist(B, coef(dr, ncol(B)))
|
||||||
|
}, error = function(e) {}) # No-Operation error handler!
|
||||||
|
}
|
||||||
|
# explicit call to the garbage collector.
|
||||||
|
gc()
|
||||||
|
}
|
||||||
|
|
||||||
|
path <- paste0(getwd(),
|
||||||
|
format(Sys.time(), '/results/method_compair_%Y-%m-%dT%H%M%S'))
|
||||||
|
# Write entire simulation results into a single file.
|
||||||
|
write.csv(result, file = paste0(path, ".csv"), row.names = FALSE)
|
|
@ -0,0 +1,73 @@
|
||||||
|
# Specialized comparison for dataset M2 with others than the default parameters.
|
||||||
|
#
|
||||||
|
# Results are written to a CSV file with a dataset column and its simulation nr.
|
||||||
|
#
|
||||||
|
# Note:
|
||||||
|
# All methods are called via interface functions defined in
|
||||||
|
# 'simulation_interfaces.R'. In addition this file provides a few convenience
|
||||||
|
# functions, namely: 'progress_logger' and 'subspace_dist'.
|
||||||
|
#
|
||||||
|
# Dependencies (list of packages loaded):
|
||||||
|
# - MASS
|
||||||
|
# - dr
|
||||||
|
# - MAVE
|
||||||
|
# - CVE
|
||||||
|
source('simulation_interfaces.R') # depends on MASS, dr, MAVE and CVE
|
||||||
|
|
||||||
|
# Number of simulations
|
||||||
|
NR.SIM <- 100L
|
||||||
|
|
||||||
|
# Dataset param grid
|
||||||
|
grid <- expand.grid(
|
||||||
|
pmix = c(0.3, 0.4, 0.5),
|
||||||
|
lambda = c(0, 0.5, 1, 1.5)
|
||||||
|
)
|
||||||
|
|
||||||
|
# Methods for comparison, see 'simulation_interfaces.R' for their definitions.
|
||||||
|
methods <- c('CVE', 'wCVE', 'rCVE', 'meanOPG', 'meanMAVE')
|
||||||
|
|
||||||
|
# Build result data matrix (repeat each dataset simulation times).
|
||||||
|
result <- matrix(NA, nrow = nrow(grid) * NR.SIM,
|
||||||
|
ncol = ncol(grid) + length(methods))
|
||||||
|
colnames(result) <- c(colnames(grid), methods)
|
||||||
|
|
||||||
|
# Create a progress logger for length many reports.
|
||||||
|
logger <- progress_logger(nrow(grid) * NR.SIM * length(methods))
|
||||||
|
count <- 0
|
||||||
|
for (sim in seq_len(NR.SIM)) {
|
||||||
|
for (i in seq_len(nrow(grid))) {
|
||||||
|
# Create new dataset.
|
||||||
|
pmix <- grid[i, "pmix"]
|
||||||
|
lambda <- grid[i, "lambda"]
|
||||||
|
|
||||||
|
with(dataset("M2", pmix = pmix, lambda = lambda), {
|
||||||
|
X <<- X; Y <<- Y; B <<- B
|
||||||
|
})
|
||||||
|
|
||||||
|
count <- count + 1
|
||||||
|
result[count, "pmix"] <- pmix
|
||||||
|
result[count, "lambda"] <- lambda
|
||||||
|
|
||||||
|
# And each reference method.
|
||||||
|
for (method in methods) {
|
||||||
|
# Report simulation progress to user.
|
||||||
|
logger(method)
|
||||||
|
|
||||||
|
# Try/Catch to avoid simulation stop on error, this should not
|
||||||
|
# accure for CVE but may be the case for others like OPG.
|
||||||
|
tryCatch({
|
||||||
|
# Call dimension reduction method.
|
||||||
|
dr <- eval(call(method, X = X, Y = Y, k = ncol(B)))
|
||||||
|
# Compute distance of estimated `B` to true `B`.
|
||||||
|
result[count, method] <- subspace_dist(B, coef(dr, ncol(B)))
|
||||||
|
}, error = function(e) {}) # No-Operation error handler!
|
||||||
|
}
|
||||||
|
}
|
||||||
|
# explicit call to the garbage collector.
|
||||||
|
gc()
|
||||||
|
}
|
||||||
|
|
||||||
|
path <- paste0(getwd(),
|
||||||
|
format(Sys.time(), '/results/method_compair_M2_%Y-%m-%dT%H%M%S'))
|
||||||
|
# Write entire simulation results into a single file.
|
||||||
|
write.csv(result, file = paste0(path, ".csv"), row.names = FALSE)
|
|
@ -0,0 +1,69 @@
|
||||||
|
### Method Compair plots.
|
||||||
|
for (file in list.files('results', pattern = 'method_compair_[0-9T-]+\\.csv')) {
|
||||||
|
# Read simulation result data set.
|
||||||
|
ds <- read.csv(file.path('results', file), stringsAsFactors = FALSE)
|
||||||
|
# Get dataset groups
|
||||||
|
datasets <- unlist(unique(ds["dataset"]))
|
||||||
|
# All seperate dataset plots into one file.
|
||||||
|
pdf(file.path('results', sub('.csv', '.pdf', file, fixed = TRUE)))
|
||||||
|
for (ds.name in datasets) {
|
||||||
|
boxplot(ds[ds["dataset"] == ds.name, colnames(ds) != "dataset"],
|
||||||
|
main = ds.name, ylab = 'Error', las = 2)
|
||||||
|
}
|
||||||
|
dev.off()
|
||||||
|
# Plot each dataset seperately.
|
||||||
|
for (ds.name in datasets) {
|
||||||
|
pdf(file.path('results', sub('.csv',
|
||||||
|
paste0('_', ds.name, '.pdf'),
|
||||||
|
file, fixed = TRUE)))
|
||||||
|
boxplot(ds[ds["dataset"] == ds.name, colnames(ds) != "dataset"],
|
||||||
|
main = ds.name, ylab = 'Error', las = 2)
|
||||||
|
dev.off()
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
### Method Compair plots for parameterized M2.
|
||||||
|
for (file in list.files('results', 'method_compair_M2_[0-9T-]+\\.csv')) {
|
||||||
|
# Read simulation results.
|
||||||
|
ds <- read.csv(file.path('results', file), stringsAsFactors = FALSE)
|
||||||
|
# Extract M2 dataset parameters.
|
||||||
|
pmix <- unique(ds$pmix)
|
||||||
|
lambda <- unique(ds$lambda)
|
||||||
|
# setup pdf file and plot layout.
|
||||||
|
pdf(file.path('results', sub('.csv', '.pdf', file, fixed = TRUE)),
|
||||||
|
width = 15, height = 10)
|
||||||
|
layout(matrix(seq(length(pmix) * length(lambda)),
|
||||||
|
ncol = length(lambda), byrow = TRUE))
|
||||||
|
for (p in pmix) {
|
||||||
|
for (l in lambda) {
|
||||||
|
boxplot(ds[ds$pmix == p & ds$lambda == l,
|
||||||
|
!(colnames(ds) %in% c("pmix", "lambda"))],
|
||||||
|
main = substitute(expression(paste(
|
||||||
|
lambda, ' = ', l, ', ', p[mix], ' = ', q
|
||||||
|
)), list(l = l, q = p)))
|
||||||
|
}
|
||||||
|
}
|
||||||
|
dev.off()
|
||||||
|
}
|
||||||
|
|
||||||
|
### Analyse predict dimension simulation.
|
||||||
|
for (file in list.files('results', 'predict_dim_compair_[0-9T-]+\\.csv')) {
|
||||||
|
# Read simulation result data set.
|
||||||
|
ds <- read.csv(file.path('results', file), stringsAsFactors = FALSE)
|
||||||
|
# Over all datasets (including M2 with default parameters)
|
||||||
|
true.dim <- ds$true.dim
|
||||||
|
dataset <- ds$dataset
|
||||||
|
df <- ds[!(colnames(ds) %in% c("true.dim", "dataset"))]
|
||||||
|
accuracy <- data.frame(
|
||||||
|
all = colMeans(df == true.dim),
|
||||||
|
M1 = colMeans(df[dataset == "M1", ] == true.dim[dataset == "M1"]),
|
||||||
|
M2 = colMeans(df[dataset == "M2", ] == true.dim[dataset == "M2"]),
|
||||||
|
M3 = colMeans(df[dataset == "M3", ] == true.dim[dataset == "M3"]),
|
||||||
|
M4 = colMeans(df[dataset == "M4", ] == true.dim[dataset == "M4"]),
|
||||||
|
M5 = colMeans(df[dataset == "M5", ] == true.dim[dataset == "M5"]),
|
||||||
|
M6 = colMeans(df[dataset == "M6", ] == true.dim[dataset == "M6"]),
|
||||||
|
M7 = colMeans(df[dataset == "M7", ] == true.dim[dataset == "M7"])
|
||||||
|
)
|
||||||
|
cat("Accuracy of simulation:", file, '\n')
|
||||||
|
print(accuracy)
|
||||||
|
}
|
|
@ -0,0 +1,54 @@
|
||||||
|
# Comparison of dimension prediction methods including maves cross-validation
|
||||||
|
# with meanMAVE as basis for reference.
|
||||||
|
#
|
||||||
|
# Results are written to a CSV file with a dataset column and its simulation nr.
|
||||||
|
#
|
||||||
|
# Note:
|
||||||
|
# All methods are called via interface functions defined in
|
||||||
|
# 'simulation_interfaces.R'. In addition this file provides a few convenience
|
||||||
|
# functions.
|
||||||
|
source('simulation_interfaces.R')
|
||||||
|
|
||||||
|
# Number of simulations
|
||||||
|
NR.SIM <- 100L
|
||||||
|
NR.DS <- 7L
|
||||||
|
max.dim <- 5L
|
||||||
|
|
||||||
|
# List of dimension prediction methods.
|
||||||
|
methods <- c("cv", "elbow", "wilcoxon")
|
||||||
|
|
||||||
|
# Build result data matrix (repeat each dataset simulation times).
|
||||||
|
result <- matrix(NA, nrow = NR.DS * NR.SIM,
|
||||||
|
ncol = length(methods) + 2) # +2 for dataset and meanMAVE.
|
||||||
|
result[, 1L] <- rep(paste0("M", seq(NR.DS)), NR.SIM)
|
||||||
|
colnames(result) <- c("dataset", "true.dim", "mave.cv", methods)
|
||||||
|
|
||||||
|
# Create a progress logger for length many reports.
|
||||||
|
logger <- progress_logger(NR.DS * NR.SIM)
|
||||||
|
for (sim in seq_len(nrow(result))) {
|
||||||
|
# Create new dataset.
|
||||||
|
with(dataset(result[sim, 1L]), { X <<- X; Y <<- Y; B <<- B })
|
||||||
|
# Write true dr dimension.
|
||||||
|
result[sim, "true.dim"] <- ncol(B)
|
||||||
|
|
||||||
|
# Report simulation progress to user.
|
||||||
|
logger('')
|
||||||
|
|
||||||
|
# Call CVE and let each dimension prediction method estimate.
|
||||||
|
dr <- cve.call(X = X, Y = Y, max.dim = max.dim)
|
||||||
|
for (method in methods) {
|
||||||
|
result[sim, method] <- predict_dim(dr, method = method)$k
|
||||||
|
}
|
||||||
|
|
||||||
|
# The same using meanMAVE and MAVE's 'mave.dim'.
|
||||||
|
dr <- meanMAVE(X = X, Y = Y, max.dim = max.dim)
|
||||||
|
result[sim, "mave.cv"] <- which.min(mave.dim(dr)$cv)
|
||||||
|
|
||||||
|
# explicit call to the garbage collector.
|
||||||
|
gc()
|
||||||
|
}
|
||||||
|
|
||||||
|
path <- paste0(getwd(),
|
||||||
|
format(Sys.time(), '/results/predict_dim_compair_%Y-%m-%dT%H%M%S'))
|
||||||
|
# Write entire simulation results into a single file.
|
||||||
|
write.csv(result, file = paste0(path, ".csv"), row.names = FALSE)
|
|
@ -0,0 +1,128 @@
|
||||||
|
# Usage:
|
||||||
|
# ~$ Rscript runtime_test.R
|
||||||
|
|
||||||
|
textplot <- function(...) {
|
||||||
|
text <- unlist(list(...))
|
||||||
|
if (length(text) > 20) {
|
||||||
|
text <- c(text[1:17],
|
||||||
|
' ...... (skipped, text too long) ......',
|
||||||
|
text[c(-1, 0) + length(text)])
|
||||||
|
}
|
||||||
|
|
||||||
|
plot(NA, xlim = c(0, 1), ylim = c(0, 1),
|
||||||
|
bty = 'n', xaxt = 'n', yaxt = 'n', xlab = '', ylab = '')
|
||||||
|
|
||||||
|
for (i in seq_along(text)) {
|
||||||
|
text(0, 1 - (i / 20),
|
||||||
|
text[[i]], pos = 4)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
# library(CVEpureR) # load CVE's pure R implementation
|
||||||
|
library(CVE) # load CVE
|
||||||
|
|
||||||
|
#' Writes log information to console. (to not get bored^^)
|
||||||
|
tell.user <- function(name, start, i, length) {
|
||||||
|
cat("\rRunning Test (", name, "):",
|
||||||
|
i, "/", length,
|
||||||
|
" - elapsed:", format(Sys.time() - start), "\033[K")
|
||||||
|
}
|
||||||
|
#' Computes "distance" of spanned subspaces.
|
||||||
|
#' @param B1 Semi-orthonormal basis matrix
|
||||||
|
#' @param B2 Semi-orthonormal basis matrix
|
||||||
|
#' @return Frobenius norm of subspace projection matrix diff.
|
||||||
|
subspace.dist <- function(B1, B2){
|
||||||
|
P1 <- tcrossprod(B1, B1)
|
||||||
|
P2 <- tcrossprod(B2, B2)
|
||||||
|
return(norm(P1 - P2, type = 'F'))
|
||||||
|
}
|
||||||
|
|
||||||
|
# Set random seed
|
||||||
|
set.seed(437)
|
||||||
|
|
||||||
|
# Number of simulations
|
||||||
|
SIM.NR <- 50L
|
||||||
|
# maximal number of iterations in curvilinear search algorithm
|
||||||
|
MAXIT <- 50L
|
||||||
|
# number of arbitrary starting values for curvilinear optimization
|
||||||
|
ATTEMPTS <- 10L
|
||||||
|
# set names of datasets
|
||||||
|
ds.names <- paste0("M", seq(7))
|
||||||
|
# Set used CVE method
|
||||||
|
methods <- c("simple", "weighted") # c("legacy", "simple", "linesearch", "sgd")
|
||||||
|
|
||||||
|
# Setup error and time tracking variables
|
||||||
|
error <- matrix(NA, SIM.NR, length(methods) * length(ds.names))
|
||||||
|
time <- matrix(NA, SIM.NR, ncol(error))
|
||||||
|
colnames(error) <- kronecker(paste0(ds.names, '-'), methods, paste0)
|
||||||
|
colnames(time) <- colnames(error)
|
||||||
|
|
||||||
|
# Create new log file and write CSV (actualy TSV) header.
|
||||||
|
# (do not overwrite existing logs)
|
||||||
|
log.nr <- length(list.files('tmp/', pattern = '.*\\.log'))
|
||||||
|
file <- file.path('tmp', paste0('test', log.nr, '.log'))
|
||||||
|
cat('dataset\tmethod\terror\ttime\n', sep = '', file = file)
|
||||||
|
# Open a new pdf device for plotting into (do not overwrite existing ones)
|
||||||
|
path <- paste0('test', log.nr, '.pdf')
|
||||||
|
pdf(file.path('tmp', path))
|
||||||
|
cat('Plotting to file:', path, '\n')
|
||||||
|
|
||||||
|
# only for telling user (to stdout)
|
||||||
|
count <- 0
|
||||||
|
start <- Sys.time()
|
||||||
|
# Start simulation loop.
|
||||||
|
for (sim in 1:SIM.NR) {
|
||||||
|
# Repeat for each dataset.
|
||||||
|
for (name in ds.names) {
|
||||||
|
tell.user(name, start, (count <- count + 1), SIM.NR * length(ds.names))
|
||||||
|
|
||||||
|
# Create a new dataset
|
||||||
|
ds <- dataset(name)
|
||||||
|
# Prepare X, Y and combine to data matrix
|
||||||
|
Y <- ds$Y
|
||||||
|
X <- ds$X
|
||||||
|
data <- cbind(Y, X)
|
||||||
|
# get dimensions
|
||||||
|
k <- ncol(ds$B)
|
||||||
|
|
||||||
|
for (method in methods) {
|
||||||
|
dr.time <- system.time(
|
||||||
|
dr <- cve.call(X, Y,
|
||||||
|
method = method,
|
||||||
|
k = k,
|
||||||
|
attempts = ATTEMPTS
|
||||||
|
)
|
||||||
|
)
|
||||||
|
dr$B <- coef(dr, k)
|
||||||
|
|
||||||
|
key <- paste0(name, '-', method)
|
||||||
|
error[sim, key] <- subspace.dist(dr$B, ds$B) / sqrt(2 * k)
|
||||||
|
time[sim, key] <- dr.time["elapsed"]
|
||||||
|
|
||||||
|
# Log results to file (mostly for long running simulations)
|
||||||
|
cat(paste0(
|
||||||
|
c(name, method, error[sim, key], time[sim, key]),
|
||||||
|
collapse = '\t'
|
||||||
|
), '\n',
|
||||||
|
sep = '', file = file, append = TRUE
|
||||||
|
)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
cat("\n\n## Time [sec] Summary:\n")
|
||||||
|
print(summary(time))
|
||||||
|
cat("\n## Error Summary:\n")
|
||||||
|
print(summary(error))
|
||||||
|
|
||||||
|
boxplot(error,
|
||||||
|
main = paste0("Error (Nr of simulations ", SIM.NR, ")"),
|
||||||
|
ylab = "Error",
|
||||||
|
las = 2
|
||||||
|
)
|
||||||
|
boxplot(time,
|
||||||
|
main = paste0("Time (Nr of simulations ", SIM.NR, ")"),
|
||||||
|
ylab = "Time [sec]",
|
||||||
|
las = 2
|
||||||
|
)
|
||||||
|
|
|
@ -0,0 +1,226 @@
|
||||||
|
library(MASS) # required by 'dr'
|
||||||
|
library(dr) # for `dr`
|
||||||
|
library(MAVE) # for `mave`, `coef.mave`.
|
||||||
|
library(CVE)
|
||||||
|
|
||||||
|
################################################################################
|
||||||
|
### simulation utility functions ###
|
||||||
|
################################################################################
|
||||||
|
|
||||||
|
#' Calc distance between two subspaces.
|
||||||
|
#'
|
||||||
|
#' || P(A) - P(B) ||_F / sqrt(2 * ncol(A))
|
||||||
|
#'
|
||||||
|
#' such that `P(A) = A (A^T A)^-1 A^T` the projection onto
|
||||||
|
#' the subspace spanned by `A`.
|
||||||
|
#'
|
||||||
|
#' @param A matrix which rows span a subspace.
|
||||||
|
#' @param B matrix of the same dim. as `A`-
|
||||||
|
#'
|
||||||
|
#' @returns numeric metric of spanning subspaces distance.
|
||||||
|
subspace_dist <- function(A, B) {
|
||||||
|
P <- A %*% solve(t(A) %*% A, t(A))
|
||||||
|
Q <- B %*% solve(t(B) %*% B, t(B))
|
||||||
|
norm(P - Q, 'F') / sqrt(2 * ncol(A))
|
||||||
|
}
|
||||||
|
|
||||||
|
#' Creats a progress logger.
|
||||||
|
#'
|
||||||
|
#' @param len number of calls to the creted rpogress logger till finished.
|
||||||
|
#'
|
||||||
|
#' @returns progress logger function:
|
||||||
|
#' \code{logger(info)}
|
||||||
|
#' which is to be called \code{len} times with a string \code{info}.
|
||||||
|
progress_logger <- function(len) {
|
||||||
|
start <- Sys.time()
|
||||||
|
count <- 0
|
||||||
|
return(function(info) {
|
||||||
|
cat("\rRunning (", info, "):",
|
||||||
|
(count <<- count + 1), "/", len,
|
||||||
|
" - elapsed:", format(Sys.time() - start), "\033[K")
|
||||||
|
if (count == len) { cat("\n") }
|
||||||
|
})
|
||||||
|
}
|
||||||
|
|
||||||
|
################################################################################
|
||||||
|
### cve wrapper ###
|
||||||
|
################################################################################
|
||||||
|
CVE <- function(...) {
|
||||||
|
.cve <- match.call(expand.dots = TRUE)
|
||||||
|
.cve[[1L]] <- quote(cve.call)
|
||||||
|
.cve$method <- "simple"
|
||||||
|
eval(.cve)
|
||||||
|
}
|
||||||
|
wCVE <- function(...) {
|
||||||
|
.cve <- match.call(expand.dots = TRUE)
|
||||||
|
.cve[[1L]] <- quote(cve.call)
|
||||||
|
.cve$method <- "weighted"
|
||||||
|
eval(.cve)
|
||||||
|
}
|
||||||
|
rCVE <- function(...) {
|
||||||
|
.cve <- match.call(expand.dots = TRUE)
|
||||||
|
.cve[[1L]] <- quote(cve.call)
|
||||||
|
.cve$method <- "simple"
|
||||||
|
dr <- eval(.cve)
|
||||||
|
|
||||||
|
for (name in names(dr$res)) {
|
||||||
|
.cve$method <- "weighted"
|
||||||
|
.cve$k <- dr$res[[name]]$k
|
||||||
|
.cve$V.init <- dr$res[[name]]$V
|
||||||
|
dr$res[[name]] <- eval(.cve)$res[[name]]
|
||||||
|
}
|
||||||
|
dr$call <- match.call()
|
||||||
|
dr$call[[1L]] <- quote(cve)
|
||||||
|
dr$call$method <- "refined"
|
||||||
|
return(dr)
|
||||||
|
}
|
||||||
|
|
||||||
|
wls <- function(x,y,w){
|
||||||
|
x <- as.matrix(x); y <- as.vector(y)
|
||||||
|
n=dim(x)[1];
|
||||||
|
p=dim(x)[2]-1
|
||||||
|
out=c(solve(t(x*w)%*%x/n)%*%apply(x*y*w,2,mean))
|
||||||
|
return(list(a=out[1],b=out[2:(p+1)]))
|
||||||
|
}
|
||||||
|
|
||||||
|
#' Compute kernel matrix.
|
||||||
|
#'
|
||||||
|
#' @param X matrix of dim. `n x p` (`n` samples of dimension `p`)
|
||||||
|
#' @param h numeric bandwidth.
|
||||||
|
#'
|
||||||
|
#' @returns `n x n` matrix with kernel of X_i to X_j interaction
|
||||||
|
#' in its `i, j` component.
|
||||||
|
kern <- function(X, h) {
|
||||||
|
X <- as.matrix(X)
|
||||||
|
n <- nrow(X)
|
||||||
|
|
||||||
|
k2 <- X %*% t(X) # (X_i^T X_j)_ij
|
||||||
|
k3 <- matrix(diag(k2), n, n) # diag(k2) in rows
|
||||||
|
k1 <- t(k3) # diag(k2) in cols
|
||||||
|
K <- k1 - 2 * k2 + k3 # (|X_i|^2 - 2 * X_i^T X_j - |X_j|^2)_ij
|
||||||
|
return(exp((-0.5 / h^2) * K))
|
||||||
|
}
|
||||||
|
|
||||||
|
coef.opg <- function(object, ...) {
|
||||||
|
return(object$B)
|
||||||
|
}
|
||||||
|
|
||||||
|
OPG <- function(X, Y, k, c0 = 2.34) {
|
||||||
|
# Cast parameters.
|
||||||
|
X <- as.matrix(X); Y <- as.vector(Y)
|
||||||
|
# Get Dimensions
|
||||||
|
n <- nrow(X); p <- ncol(X)
|
||||||
|
|
||||||
|
p0 = max(p, 3)
|
||||||
|
rn <- n^(-1 / (2 * (p0 + 6)))
|
||||||
|
h <- c0 * n^(-1 / (p0 + 6))
|
||||||
|
sig <- diag(var(X))
|
||||||
|
X <- scale(X)
|
||||||
|
kmat <- kern(X, h)
|
||||||
|
bmat <- numeric()
|
||||||
|
for (i in 1:n) {
|
||||||
|
wi <- kmat[, i]
|
||||||
|
xi <- cbind(1, t(t(X) - X[i, ]))
|
||||||
|
bmat <- cbind(bmat, wls(xi, Y, wi)$b)
|
||||||
|
}
|
||||||
|
beta <- eigen(bmat %*% t(bmat))$vectors[, 1:k]
|
||||||
|
B <- diag(sig^(-1 / 2)) %*% beta
|
||||||
|
|
||||||
|
return(structure(list(B = B), class = "opg"))
|
||||||
|
}
|
||||||
|
|
||||||
|
rOPG <- function(X, Y, k, nr.iter = 25L, c0 = 2.34) {
|
||||||
|
# Cast parameters.
|
||||||
|
X <- as.matrix(X); Y <- as.vector(Y)
|
||||||
|
# Get Dimensions
|
||||||
|
n <- nrow(X); p <- ncol(X)
|
||||||
|
|
||||||
|
sig <- diag(var(X))
|
||||||
|
X <- scale(X)
|
||||||
|
|
||||||
|
p0 = max(p, 3)
|
||||||
|
rn <- n^(-1 / (2 * (p0 + 6)))
|
||||||
|
h <- c0 * n^(-1 / (p0 + 6))
|
||||||
|
beta <- diag(p)
|
||||||
|
for(. in seq_len(nr.iter)) {
|
||||||
|
kmat <- kern(X %*% beta, h)
|
||||||
|
bmat <- numeric()
|
||||||
|
for (i in 1:n) {
|
||||||
|
wi <- kmat[, i]
|
||||||
|
xi <- cbind(1, t(t(X) - X[i, ]))
|
||||||
|
bmat <- cbind(bmat, wls(xi, Y, wi)$b)
|
||||||
|
}
|
||||||
|
beta <- eigen(bmat %*% t(bmat))$vectors[, 1:k, drop = F]
|
||||||
|
h <- max(rn * h, c0 * n^((-1/(k + 4))))
|
||||||
|
}
|
||||||
|
B <- diag(sig^(-1/2)) %*% beta
|
||||||
|
|
||||||
|
return(structure(list(B = B), class = "opg"))
|
||||||
|
}
|
||||||
|
|
||||||
|
rmave <- function(X, Y, k, nr.iter = 25L, c0 = 2.34) {
|
||||||
|
X <- as.matrix(X); Y <- as.vector(Y)
|
||||||
|
# Get dimensions.
|
||||||
|
n <- nrow(X); p <- ncol(X)
|
||||||
|
|
||||||
|
X <- scale(X)
|
||||||
|
sig <- attr(X, "scaled:scale")^2 # diag(var(X))
|
||||||
|
|
||||||
|
p0 <- max(p,3)
|
||||||
|
h <- c0 * n^(-1 / (p0 + 6))
|
||||||
|
rn <- n^(-1 / (2 * (p0 + 6)))
|
||||||
|
|
||||||
|
beta <- OPG(X, Y, k)$B
|
||||||
|
for (. in seq_len(nr.iter)) {
|
||||||
|
kermat <- kern(X %*% beta, h)
|
||||||
|
mkermat <- colMeans(kermat)
|
||||||
|
|
||||||
|
a <- numeric()
|
||||||
|
b <- numeric()
|
||||||
|
for (i in 1:n) {
|
||||||
|
wi <- kermat[ ,i] / mkermat[i]
|
||||||
|
ui <- cbind(1, t(t(X) - X[i, ]) %*% beta)
|
||||||
|
out <- wls(ui, Y, wi)
|
||||||
|
a <- c(a, out$a)
|
||||||
|
b <- cbind(b, out$b)
|
||||||
|
}
|
||||||
|
out <- 0
|
||||||
|
out1 <- 0
|
||||||
|
for (i in 1:n) {
|
||||||
|
xi <- kronecker(t(t(X) - X[i, ]), t(b[, i]))
|
||||||
|
yi <- Y - a[i]
|
||||||
|
wi <- kermat[, i] / mkermat[i]
|
||||||
|
out <- out + colMeans(xi * yi * wi)
|
||||||
|
out1 <- out1 + t(xi * wi) %*% xi / n
|
||||||
|
}
|
||||||
|
beta <- t(matrix(solve(out1, out), k, p))
|
||||||
|
h <- max(rn * h, c0 * n^(-1 / (k + 4)))
|
||||||
|
}
|
||||||
|
B <- diag(sig^-0.5) %*% beta
|
||||||
|
|
||||||
|
return(structure(list(B = B), class = "opg"))
|
||||||
|
}
|
||||||
|
|
||||||
|
meanOPG <- function(X, Y, max.dim = 10L, ...) {
|
||||||
|
mave(Y ~ X, method = "meanOPG", max.dim = max.dim)
|
||||||
|
}
|
||||||
|
|
||||||
|
meanMAVE <- function(X, Y, max.dim = 10L, ...) {
|
||||||
|
mave(Y ~ X, method = "meanMAVE", max.dim = max.dim)
|
||||||
|
}
|
||||||
|
|
||||||
|
coef.phdy <- coef.sir <- coef.save <- function(object, k, ...) {
|
||||||
|
matrix(object$evectors[, 1:k], ncol = k)
|
||||||
|
}
|
||||||
|
|
||||||
|
phdy <- function(X, Y, k, ...) {
|
||||||
|
dr(Y ~ X, method = 'phdy', numdir = k)
|
||||||
|
}
|
||||||
|
|
||||||
|
sir <- function(X, Y, k, ...) {
|
||||||
|
dr(Y ~ X, method = 'sir', numdir = k)
|
||||||
|
}
|
||||||
|
|
||||||
|
save <- function(X, Y, k, ...) {
|
||||||
|
dr(Y ~ X, method = 'save', numdir = k)
|
||||||
|
}
|
Loading…
Reference in New Issue