Browse Source

- add: simulations,

- wip: README
tags/0.2
Daniel Kapla 1 year ago
parent
commit
c437c60e84
9 changed files with 823 additions and 441 deletions
  1. +24
    -441
      README.md
  2. +80
    -0
      simulations/hitters.R
  3. +108
    -0
      simulations/hitters_logging.R
  4. +61
    -0
      simulations/method_compair.R
  5. +73
    -0
      simulations/method_compair_M2.R
  6. +69
    -0
      simulations/plots.R
  7. +54
    -0
      simulations/predict_dim_compair.R
  8. +128
    -0
      simulations/runtime_test.R
  9. +226
    -0
      simulations/simulation_interfaces.R

+ 24
- 441
README.md View File

@@ -1,451 +1,34 @@
# TODOs
Doc:
- [x] Stiefel (instead of Stiefl)
- [x] Return value description (`@returs`)
- [x] DESCRIPTION
- [x] Maintainer
- [x] Author
- [x] Volume
- [x] Description (from Paper) and Ref.
- [x] Ref paper in doc
- [ ] Data set descriptions and augmentations.
- [x] Demonstration of the `Logger` function usage (Demo file or so, ...)
- [ ] Update Paper (to new version / version consistent with current code!)
- [ ] Reference Paper in DESCRIPTION file (in Description or specific tag)
- [ ] Split `cve` and `cve.call` docs.
- [ ] "Copy" form `dr` package (specifically `dr.directions` -> description)
- [ ] Document `C` code.

Methods to be implemented:
- [x] simple
- [x] weighted
- [x] momentum
- [x] weighted with momentum

Performance:
- [x] Pure C implementation.
- [NOT Feasible] Stochastic Version
- [NOT Feasible] Gradient Approximations (using Algebraic Software for alternative Loss function formulations and gradient optimizations)
- [NOT Sufficient] Alternative Kernels for reducing samples
- [x] (To Be further investigated) "Kronecker" optimization
- [ ] Implement "Kronecker" optimization

Features (functions):
- [x] Initial `V.init` parameter (only ONE try, ignore number of `attempts` parameter)
- [x] `basis.cve` list of estimated `B`s (with `k` supplied, only `B`)
- [x] `directions.cve` Projected `X` given `k`
- [x] `predict.cve` using `mars` for predicting responses given new data.
- [x] `predict.dim.cve` Cross-validation or `aov` (in stats package) or "elbow" estimation
- [x] `plot.elbow`
- [x] `summary`
- [ ] Consider `cor.test` for dimension selection
- [x] Check for user interrupt (`R_CheckUserInterrupt`)

Changes:
- [x] New `estimate.bandwidth` implementation.
(h = 2 * (tr(\Sigma) / p) * (6/5 * n^(-1 / (4 + k)))^2,
\Sigma = 1/n * (X-mean)'(X-mean))

Errors:
- [x] `CVE_C` compare to `CVE_legacy`.
- [x] fix: `predict.dim` not found.
- [ ] fix/check: error computation for break condition (in `cve.c`)

# Development
## Build and install.
To build the package the `devtools` package is used. This also provides `roxygen2` which is used for documentation and automatic creation of the `NAMESPACE` file.
```R
setwd("./CVE_R") # Set path to the package root.
library(devtools) # Load required `devtools` package.
document() # Create `.Rd` files and write `NAMESPACE`.
```
Next the package needs to be build, therefore (if pure `R` package, aka. `C/C++`, `Fortran`, ... code) just do the following.
```bash
R CMD build CVE_C; R CMD INSTALL CVE_0.2.tar.gz
```
Then we are ready for using the package.
As well as building the `NAMESPACE` and `*.Rd` files using `devtools` (`roxygen2`) the following resembles an entire build pipeline including checks.
```bash
R -q -e 'library(devtools); setwd("CVE_C"); pkgbuild::compile_dll(); document(); pkgbuild::clean_dll()'
R CMD build CVE_C; R CMD check CVE_0.2.tar.gz;
R CMD INSTALL CVE_0.2.tar.gz
```
## Build and install from within `R`.
An alternative approach is the following.
```R
## Installing CVE (C implementation)
(setwd('~/Projects/CVE/CVE_C'))
# equiv to Rcpp::compileAttributes().
library(devtools)
pkgbuild::compile_dll() # required for packages with C/C++ code
document() # See bug: https://github.com/stan-dev/rstantools/issues/52
pkgbuild::clean_dll()
(path <- build(vignettes = FALSE))
install.packages(path, repos = NULL, type = "source")
library(CVE)
```
**Note: I only recommend this approach during development.**

# Package Structure

## Demos
A demo is an `.R` file that lives in `demo/`. Demos are like examples but tend to
be longer. Instead of focusing on a single function, they show how to weave
together multiple functions to solve a problem.

You list and access demos with `demo()`:
* Show all available demos: `demo()`.
* Show all demos in a package: `demo(package = "CVE")`.
* Run a specific demo: `demo("runtime_test", package = "CVE")`.
* Find a demo: `system.file("demo", "runtime_test.R", package = "CVE")`.

Each demo must be listed in `demo/00Index` in the following form:
`demo-name Demo description`.
The demo name is the name of the file without the extension,
e.g. `demo/runtime_test.R` becomes `runtime_test`.

By default the demo ask for human input for each plot: "Hit to see next plot".
This behavior can be overridden by adding `devAskNewPage(ask = FALSE)` to
the demo file. You can add pauses by adding:
`readline("press any key to continue")`.

**Note**: Demos are not automatically tested by `R CMD check`. This means that they
can easily break without your knowledge.


# General Notes for Source Code analysis
## Search in multiple files.
Using the Linux `grep` program with the parameters `-rnw` and specifying a include files filter like the following example.
```bash
grep --include=*\.{c,h,R} -rnw '.' -e "sweep"
```
searches in all `C` source and header files as well as `R` source files for the term _sweep_.

## Recursive directory compare with colored structure (more or less).
```bash
diff -r CVE_R/ CVE_C/ | grep -E "^([<>]|[^<>].*)"
```

## Parsing `bash` script parameters.
```bash
usage="$0 [-v|--verbose] [-n|--dry-run] [(-s|--stack-size) <size>] [-h|--help] [-- [p1, [p2, ...]]]"
verbose=false
help=false
dry_run=false
stack_size=0

while [ $# -gt 0 ]; do
case "$1" in
-v | --verbose ) verbose=true; shift ;;
-n | --dry-run ) dry_run=true; shift ;;
-s | --stack-size ) stack_size="$2"; shift; shift ;;
-h | --help ) echo $usage; exit ;; # On help print usage and exit.
-- ) shift; break ;; # Break param "parsing".
* ) echo $usage >&2; exit 1 ;; # Print usage and exit with failure.
esac
done

echo verbose=$verbose
echo dry_run=$dry_run
echo stack_size=$stack_size
```

# Analysis
## Logging (a `cve` run).
To log `loss`, `error` (estimated) the true error (error of current estimated `B` against the true `B`) or even the step size one can use the `logger` parameter. A `logger` is a function that gets the current `environment` of the CVE optimization methods (__do not alter this environment, only read from it__). This can be used to create logs like in the following example.
```R
library(CVE)

# Setup histories.
(epochs <- 50)
(attempts <- 10)
loss.history <- matrix(NA, epochs + 1, attempts)
error.history <- matrix(NA, epochs + 1, attempts)
tau.history <- matrix(NA, epochs + 1, attempts)
true.error.history <- matrix(NA, epochs + 1, attempts)

# Create a dataset
ds <- dataset("M1")
X <- ds$X
Y <- ds$Y
B <- ds$B # the true `B`
(k <- ncol(ds$B))

# True projection matrix.
P <- B %*% solve(t(B) %*% B) %*% t(B)
# Define the logger for the `cve()` method.
logger <- function(env) {
# Note the `<<-` assignement!
loss.history[env$epoch + 1, env$attempt] <<- env$loss
error.history[env$epoch + 1, env$attempt] <<- env$error
tau.history[env$epoch + 1, env$attempt] <<- env$tau
# Compute true error by comparing to the true `B`
B.est <- null(env$V) # Function provided by CVE
P.est <- B.est %*% solve(t(B.est) %*% B.est) %*% t(B.est)
true.error <- norm(P - P.est, 'F') / sqrt(2 * k)
true.error.history[env$epoch + 1, env$attempt] <<- true.error
}
# Performa SDR
dr <- cve(Y ~ X, k = k, logger = logger, epochs = epochs, attempts = attempts)
# Plot history's
par(mfrow = c(2, 2))
matplot(loss.history, type = 'l', log = 'y', xlab = 'iter',
main = 'loss', ylab = expression(L(V[iter])))
matplot(error.history, type = 'l', log = 'y', xlab = 'iter',
main = 'error', ylab = 'error')
matplot(tau.history, type = 'l', log = 'y', xlab = 'iter',
main = 'tau', ylab = 'tau')
matplot(true.error.history, type = 'l', log = 'y', xlab = 'iter',
main = 'true error', ylab = 'true error')
```

## Reading log files.
The run-time tests (upcoming further tests) are creating log files saved in `tmp/`. These log files are `CSV` files (actually `TSV`) with a header storing the test results. Depending on the test the files may contain different data. As an example we use the run-time test logs which store in each line the `dataset`, the used `method` as well as the `error` (actual error of estimated `B` against real `B`) and the `time`. For reading and analyzing the data see the following example.
```R
# Load log as `data.frame`
log <- read.csv('tmp/test0.log', sep = '\t')
# Create a error boxplot grouped by dataset.
boxplot(error ~ dataset, log)

# Overview
for (ds.name in paste0('M', seq(5))) {
ds <- subset(log, dataset == ds.name, select = c('method', 'dataset', 'time', 'error'))
print(summary(ds))
}
```

## Environments and variable lookup.
In the following a view simple examples of how `R` searches for variables.
In addition we manipulate function closures to alter the search path in variable lookup and outer scope variable manipulation.
# Installation Instructions
The package depends on the `mda` package (see: [mda @ cran.r-project](https://cran.r-project.org/web/packages/mda/index.html). Therefore the first step is to install the `mda` package
```R
droids <- "These aren't the droids you're looking for."

search <- function() {
print(droids)
}

trooper.seeks <- function() {
droids <- c("R2-D2", "C-3PO")
search()
}

jedi.seeks <- function() {
droids <- c("R2-D2", "C-3PO")
environment(search) <- environment()
search()
}

trooper.seeks()
# [1] "These aren't the droids you're looking for."
jedi.seeks()
# [1] "R2-D2", "C-3PO"
install.packages("mda")
```
A release will be available in a few days.

The next example illustrates how to write (without local copies) to variables outside the functions local environment.
```R
counting <- function() {
count <<- count + 1 # Note the `<<-` assignment.
}

(function() {
environment(counting) <- environment()
count <- 0

for (i in 1:10) {
counting()
}

return(count)
})()

(function () {
closure <- new.env()
environment(counting) <- closure
assign("count", 0, envir = closure)
- Linux: Not available jet (one or two days, state 17.12.2019)
- Windows: Not available jet (one or two days, state 17.12.2019)
- MacOS: Not available jet (unknown)

for (i in 1:10) {
counting()
}

return(closure$count)
})()
```

Another example for the usage of `do.call` where the evaluation of parameters is illustrated (example taken (and altered) from `?do.call`).
```R
## examples of where objects will be found.
A <- "A.Global"
f <- function(x) print(paste("f.new", x))
env <- new.env()
assign("A", "A.new", envir = env)
assign("f", f, envir = env)
f <- function(x) print(paste("f.Global", x))
f(A) # f.Global A.Global
do.call("f", list(A)) # f.Global A.Global
do.call("f", list(A), envir = env) # f.new A.Global
do.call(f, list(A), envir = env) # f.Global A.Global
do.call("f", list(quote(A)), envir = env) # f.new A.new
do.call(f, list(quote(A)), envir = env) # f.Global A.new
do.call("f", list(as.name("A")), envir = env) # f.new A.new
do.call("f", list(as.name("A")), envir = env) # f.new A.new
```

# Performance benchmarks
In this section alternative implementations of simple algorithms are compared for there performance.

### Computing the trace of a matrix multiplication.
Open `R` and then the following:
```R
library(microbenchmark)

A <- matrix(runif(120), 12, 10)

# Check correctnes and benckmark performance.
stopifnot(
all.equal(
sum(diag(t(A) %*% A)), sum(diag(crossprod(A, A)))
),
all.equal(
sum(diag(t(A) %*% A)), sum(A * A)
)
)
microbenchmark(
MM = sum(diag(t(A) %*% A)),
cross = sum(diag(crossprod(A, A))),
elem = sum(A * A)
)
# Unit: nanoseconds
# expr min lq mean median uq max neval
# MM 4232 4570.0 5138.81 4737 4956.0 40308 100
# cross 2523 2774.5 2974.93 2946 3114.5 5078 100
# elem 582 762.5 973.02 834 964.0 12945 100
# addapt to download file.
install.packages("path/to/cve_0.2.<end>", repos = NULL)
library(CVE) # Test installation.
```
Please consult the man-pages `?install.package` and `?library` for further information.

```R
n <- 200
M <- matrix(runif(n^2), n, n)

dnorm2 <- function(x) exp(-0.5 * x^2) / sqrt(2 * pi)

stopifnot(
all.equal(dnorm(M), dnorm2(M))
)
microbenchmark(
dnorm = dnorm(M),
dnorm2 = dnorm2(M),
exp = exp(-0.5 * M^2) # without scaling -> irrelevant for usage
)
# Unit: microseconds
# expr min lq mean median uq max neval
# dnorm 841.503 843.811 920.7828 855.7505 912.4720 2405.587 100
# dnorm2 543.510 580.319 629.5321 597.8540 607.3795 2603.763 100
# exp 502.083 535.943 577.2884 548.3745 561.3280 2113.220 100
```

### Using `crosspord()`
```R
p <- 12
q <- 10
V <- matrix(runif(p * q), p, q)

stopifnot(
all.equal(V %*% t(V), tcrossprod(V)),
all.equal(V %*% t(V), tcrossprod(V, V))
)
microbenchmark(
V %*% t(V),
tcrossprod(V),
tcrossprod(V, V)
)
# Unit: microseconds
# expr min lq mean median uq max neval
# V %*% t(V) 2.293 2.6335 2.94673 2.7375 2.9060 19.592 100
# tcrossprod(V) 1.148 1.2475 1.86173 1.3440 1.4650 30.688 100
# tcrossprod(V, V) 1.003 1.1575 1.28451 1.2400 1.3685 2.742 100
```

### Recycling vs. Sweep
```R
(n <- 200)
(p <- 12)
(q <- 10)
X_diff <- matrix(runif(n * (n - 1) / 2 * p), n * (n - 1) / 2, p)
V <- matrix(rnorm(p * q), p, q)
vecS <- runif(n * (n - 1) / 2)

stopifnot(
all.equal((X_diff %*% V) * rep(vecS, q),
sweep(X_diff %*% V, 1, vecS, `*`)),
all.equal((X_diff %*% V) * rep(vecS, q),
(X_diff %*% V) * vecS)
)
microbenchmark(
rep = (X_diff %*% V) * rep(vecS, q),
sweep = sweep(X_diff %*% V, 1, vecS, `*`, check.margin = FALSE),
recycle = (X_diff %*% V) * vecS
)
# Unit: microseconds
# expr min lq mean median uq max neval
# rep 851.723 988.3655 1575.639 1203.6385 1440.578 18999.23 100
# sweep 1313.177 1522.4010 2355.269 1879.2605 2065.399 18783.24 100
# recycle 719.001 786.1265 1157.285 881.8825 1163.202 19091.79 100
```
### Scaled `crossprod` with matrix multiplication order.
```R
(n <- 200)
(p <- 12)
(q <- 10)
X_diff <- matrix(runif(n * (n - 1) / 2 * p), n * (n - 1) / 2, p)
V <- matrix(rnorm(p * q), p, q)
vecS <- runif(n * (n - 1) / 2)

ref <- crossprod(X_diff, X_diff * vecS) %*% V
stopifnot(
all.equal(ref, crossprod(X_diff, (X_diff %*% V) * vecS)),
all.equal(ref, crossprod(X_diff, (X_diff %*% V) * vecS))
)
microbenchmark(
inner = crossprod(X_diff, X_diff * vecS) %*% V,
outer = crossprod(X_diff, (X_diff %*% V) * vecS)
)
# Unit: microseconds
# expr min lq mean median uq max neval
# inner 789.065 867.939 1683.812 987.9375 1290.055 16800.265 100
# outer 1141.479 1216.929 1404.702 1317.7315 1582.800 2531.766 100
## Installing Source
Cloning the `CVE` repository and using `R`'s build and install routines from a terminal.
```bash
git clone https://git.art-ist.cc/daniel/CVE.git # Clone repository
cd CVE # Go into the repository
R CMD build CVE # Build package tarbal
R CMD INSTALL CVE_0.2.tar.gz # Install package
```

### Fast dist matrix computation (aka. row sum of squares).
```R
library(microbenchmark)
library(CVE)

(n <- 200)
(N <- n * (n - 1) / 2)
(p <- 12)
M <- matrix(runif(N * p), N, p)

stopifnot(
all.equal(rowSums(M^2), rowSums.c(M^2)),
all.equal(rowSums(M^2), rowSquareSums.c(M))
)
microbenchmark(
sums = rowSums(M^2),
sums.c = rowSums.c(M^2),
sqSums.c = rowSquareSums.c(M)
)
# Unit: microseconds
# expr min lq mean median uq max neval
# sums 666.311 1051.036 1612.3100 1139.0065 1547.657 13940.97 100
# sums.c 342.647 672.453 1009.9109 740.6255 1224.715 13765.90 100
# sqSums.c 115.325 142.128 175.6242 153.4645 169.678 759.87 100
```
### Windows / macOS
Installing from source (for any package which contains compiled code, in our case `C`) on Windows and MacOS requires additional tools.
See _R Installation and Administration_ from [r-project manuals](https://cran.r-project.org/manuals.html).

## Using `Rprof()` for performance.
The standard method for profiling where an algorithm is spending its time is with `Rprof()`.
```R
path <- '../tmp/R.prof' # path to profiling file
Rprof(path)
cve.res <- cve.call(X, Y, k = k)
Rprof(NULL)
(prof <- summaryRprof(path)) # Summarise results
```
**Note: consider to run `gc()` before measuring**, aka cleaning up by explicitly calling the garbage collector.
# Repository Structure
The repository is structured in two directories, the `CVE/` directory which is the `R` package root and `simulations/` where all simulation scripts can be found (and `README.md` which is this).

+ 80
- 0
simulations/hitters.R View File

@@ -0,0 +1,80 @@
library(ISLR) # for Hitters dataset
library(MAVE)
library(CVE)

# Set global parameters.
seed <- 21
max.dim <- 5L
attempts <- 25L
max.iter <- 100L
# momentum <- 0.0

# Prepair data for analysis.
cols <- c("Salary", "AtBat", "Hits", "HmRun", "Runs", "RBI",
"Walks", "Years", "CAtBat","CHits", "CHmRun","CRuns",
"CRBI", "CWalks", "PutOuts", "Assists", "Errors")
outliers <- c(92, # Gary Pettis
120, # John Moses
173, # Milt Thompson
189, # Rick Burleson
220, # Scott Fletcher
230, # Tom Foley
241) # Terry Puhl
# Subselect as a matrix without outliers as well as reordered
# and filtered columns.
ds <- na.omit(Hitters[, cols])[-outliers, ]
ds$Salary <- log(ds$Salary)
ds <- scale(ds, center = TRUE, scale = TRUE)
# Split into data and responce.
X <- as.matrix(ds[, colnames(ds) != "Salary"])
Y <- as.matrix(ds[, "Salary"])


pdf("results/hitters.pdf", width = 15, height = 5)
layout(matrix(c(1, 2), nrow = 1))
cat("Seed: ", seed, "\n\n")
################################################################################
### meanMAVE ###
################################################################################
set.seed(seed)

dr <- mave(Y ~ X, method = "meanMAVE", max.dim = max.dim)
dim <- which.min(mave.dim(dr)$cv)
B <- coef(dr, 2L)
# train linear model on reduced data
data <- as.data.frame(cbind(Y, X %*% B))
colnames(data) <- c("Y", "dir1", "dir2")
model <- lm(Y ~ dir1 + I(dir1^2) + dir2, data = data)
# Print and log.
cat("### meanMAVE\n\nest.dim: ", dim, '\n\n')
summary(model)
plot(x = data[, "dir1"], y = data[, "Y"],
main = "meanMave", xlab = "dir1", ylab = "Y")
plot(x = data[, "dir2"], y = data[, "Y"],
main = "meanMave", xlab = "dir2", ylab = "Y")

################################################################################
### CVE ###
################################################################################
set.seed(seed)

dr <- cve(Y ~ X, max.dim = max.dim, max.iter = max.iter, attempts = attempts)
dim <- predict_dim(dr, method = "cv")$k
B <- coef(dr, 2L)
# Determine 1st direction (cause CVE does not care for column order).
if (which.max(abs(t(coef(dr, 1L)) %*% B)) == 2) {
B <- B[, c(2, 1)] # switch directions
}
# train linear model on reduced data
data <- as.data.frame(cbind(Y, X %*% B))
colnames(data) <- c("Y", "dir1", "dir2")
model <- lm(Y ~ dir1 + I(dir1^2) + dir2, data = data)
# Print and log.
cat("### CVE\n\nest.dim: ", dim, '\n\n')
summary(model)
plot(x = data[, "dir1"], y = data[, "Y"],
main = "CVE", xlab = "dir1", ylab = "Y")
plot(x = data[, "dir2"], y = data[, "Y"],
main = "CVE", xlab = "dir2", ylab = "Y")

dev.off()

+ 108
- 0
simulations/hitters_logging.R View File

@@ -0,0 +1,108 @@
library(ISLR) # for Hitters dataset
library(MAVE)
library(CVE)

# Set global parameters.
seed <- 21
max.dim <- 5L
attempts <- 25L
max.iter <- 100L
momentum <- 0.0
name <- "Hitters"
set.seed(seed)

# Prepair data for analysis.
cols <- c("Salary", "AtBat", "Hits", "HmRun", "Runs", "RBI",
"Walks", "Years", "CAtBat","CHits", "CHmRun","CRuns",
"CRBI", "CWalks", "PutOuts", "Assists", "Errors")
outliers <- c(92, # Gary Pettis
120, # John Moses
173, # Milt Thompson
189, # Rick Burleson
220, # Scott Fletcher
230, # Tom Foley
241) # Terry Puhl
# Subselect as a matrix without outliers as well as reordered
# and filtered columns.
ds <- na.omit(Hitters[, cols])[-outliers, ]
ds$Salary <- log(ds$Salary)
ds <- scale(ds, center = TRUE, scale = TRUE)
# Split into data and responce.
X <- as.matrix(ds[, colnames(ds) != "Salary"])
Y <- as.matrix(ds[, "Salary"])

path <- file.path(getwd(), 'results', 'hitters_logger.pdf')
pdf(path, width = 8.27, height = 11.7) # width, height unit is inces -> A4
layout(matrix(c(1, 1,
2, 3,
4, 5), nrow = 3, byrow = TRUE))


# Setup histories.
loss.history <- matrix(NA, max.iter + 1, attempts)
error.history <- matrix(NA, max.iter + 1, attempts)
tau.history <- matrix(NA, max.iter + 1, attempts)
grad.norm.history <- matrix(NA, max.iter + 1, attempts)
# Define logger for `cve()` method.
logger <- function(attempt, iter, data) {
# Note the `<<-` assignement!
loss.history[iter + 1, attempt] <<- data$loss
error.history[iter + 1, attempt] <<- data$err
tau.history[iter + 1, attempt] <<- data$tau
grad.norm.history[iter + 1, attempt] <<- norm(data$G, 'F')
}

dr <- cve(Y ~ X, k = 2L, max.iter = max.iter, attempts = attempts,
logger = logger)
B <- coef(dr, 2L)
loss <- dr$res[['2']]$loss



textplot <- function(...) {
text <- unlist(list(...))
if (length(text) > 20) {
text <- c(text[1:17],
' ...... (skipped, text too long) ......',
text[c(-1, 0) + length(text)])
}

plot(NA, xlim = c(0, 1), ylim = c(0, 1),
bty = 'n', xaxt = 'n', yaxt = 'n', xlab = '', ylab = '')
for (i in seq_along(text)) {
text(0, 1 - (i / 20),
text[[i]], pos = 4)
}
}
# Write metadata.
textplot(
paste0("Seed value: ", seed),
"",
paste0("Dataset Name: ", name),
paste0("dim(X) = (", nrow(X), ", ", ncol(X), ")"),
paste0("dim(B) = (", nrow(B), ", ", ncol(B), ")"),
"",
paste0("CVE method: ", dr$method),
paste0("Max Iterations: ", max.iter),
paste0("Attempts: ", attempts),
paste0("Momentum: ", momentum),
"CVE call:",
paste0(" > ", format(dr$call)),
"",
paste0("True Error: ", NA),
paste0("loss: ", round(loss, 3))
)
# Plot history's
matplot(loss.history, type = 'l', log = 'y', xlab = 'i (iteration)',
main = paste('loss', name),
ylab = expression(L(V[i])))
matplot(grad.norm.history, type = 'l', log = 'y', xlab = 'i (iteration)',
main = paste('gradient norm', name),
ylab = expression(group('|', paste(nabla, L(V)), '|')[F]))
matplot(error.history, type = 'l', log = 'y', xlab = 'i (iteration)',
main = paste('error', name),
ylab = expression(group('|', V[i-1]*V[i-1]^T - V[i]*V[i]^T, '|')[F]))
matplot(tau.history, type = 'l', log = 'y', xlab = 'i (iteration)',
main = paste('learning rate', name),
ylab = expression(tau[i]))

+ 61
- 0
simulations/method_compair.R View File

@@ -0,0 +1,61 @@
# Compairson of differend dimension reduction (dr) methods over all datasets
# with default parametrs.
#
# Results are written to a CSV file with a dataset column and its simulation nr.
#
# Note:
# All methods are called via interface functions defined in
# 'simulation_interfaces.R'. In addition this file provides a few convenience
# functions, namely: 'progress_logger' and 'subspace_dist'.
#
# Dependencies (list of packages loaded):
# - MASS
# - dr
# - MAVE
# - CVE
source('simulation_interfaces.R') # depends on MASS, dr, MAVE and CVE

# Number of simulations
NR.SIM <- 100L
DS <- 1L:7L # datasets run from 1 to 7.

# Methods for comparison, see 'simulation_interfaces.R' for their definitions.
methods <- c('CVE', 'wCVE', 'rCVE', # CVE variants
'meanOPG', 'rOPG', # OPG variants
'meanMAVE', 'rmave', # MAVE variants
'phdy', 'sir', 'save') # Methods from dr package

# Build result data matrix (repeat each dataset simulation times).
result <- matrix(NA, nrow = length(DS) * NR.SIM,
ncol = length(methods) + 1) # +1 for dataset column.
result[, 1L] <- rep(paste0("M", DS), NR.SIM)
colnames(result) <- c("dataset", methods)

# Create a progress logger for length many reports.
logger <- progress_logger(length(DS) * NR.SIM * length(methods))
for (sim in seq_len(nrow(result))) {
# Create new dataset.
with(dataset(result[sim, 1L]), { X <<- X; Y <<- Y; B <<- B })

# And each reference method.
for (method in methods) {
# Report simulation progress to user.
logger(method)

# Try/Catch to avoid simulation stop on error, this should not
# accure for CVE but may be the case for others like OPG.
tryCatch({
# Call dimension reduction method.
dr <- eval(call(method, X = X, Y = Y, k = ncol(B)))
# Compute distance of estimated `B` to true `B`.
result[sim, method] <- subspace_dist(B, coef(dr, ncol(B)))
}, error = function(e) {}) # No-Operation error handler!
}
# explicit call to the garbage collector.
gc()
}

path <- paste0(getwd(),
format(Sys.time(), '/results/method_compair_%Y-%m-%dT%H%M%S'))
# Write entire simulation results into a single file.
write.csv(result, file = paste0(path, ".csv"), row.names = FALSE)

+ 73
- 0
simulations/method_compair_M2.R View File

@@ -0,0 +1,73 @@
# Specialized comparison for dataset M2 with others than the default parameters.
#
# Results are written to a CSV file with a dataset column and its simulation nr.
#
# Note:
# All methods are called via interface functions defined in
# 'simulation_interfaces.R'. In addition this file provides a few convenience
# functions, namely: 'progress_logger' and 'subspace_dist'.
#
# Dependencies (list of packages loaded):
# - MASS
# - dr
# - MAVE
# - CVE
source('simulation_interfaces.R') # depends on MASS, dr, MAVE and CVE

# Number of simulations
NR.SIM <- 100L

# Dataset param grid
grid <- expand.grid(
pmix = c(0.3, 0.4, 0.5),
lambda = c(0, 0.5, 1, 1.5)
)

# Methods for comparison, see 'simulation_interfaces.R' for their definitions.
methods <- c('CVE', 'wCVE', 'rCVE', 'meanOPG', 'meanMAVE')

# Build result data matrix (repeat each dataset simulation times).
result <- matrix(NA, nrow = nrow(grid) * NR.SIM,
ncol = ncol(grid) + length(methods))
colnames(result) <- c(colnames(grid), methods)

# Create a progress logger for length many reports.
logger <- progress_logger(nrow(grid) * NR.SIM * length(methods))
count <- 0
for (sim in seq_len(NR.SIM)) {
for (i in seq_len(nrow(grid))) {
# Create new dataset.
pmix <- grid[i, "pmix"]
lambda <- grid[i, "lambda"]

with(dataset("M2", pmix = pmix, lambda = lambda), {
X <<- X; Y <<- Y; B <<- B
})

count <- count + 1
result[count, "pmix"] <- pmix
result[count, "lambda"] <- lambda

# And each reference method.
for (method in methods) {
# Report simulation progress to user.
logger(method)

# Try/Catch to avoid simulation stop on error, this should not
# accure for CVE but may be the case for others like OPG.
tryCatch({
# Call dimension reduction method.
dr <- eval(call(method, X = X, Y = Y, k = ncol(B)))
# Compute distance of estimated `B` to true `B`.
result[count, method] <- subspace_dist(B, coef(dr, ncol(B)))
}, error = function(e) {}) # No-Operation error handler!
}
}
# explicit call to the garbage collector.
gc()
}

path <- paste0(getwd(),
format(Sys.time(), '/results/method_compair_M2_%Y-%m-%dT%H%M%S'))
# Write entire simulation results into a single file.
write.csv(result, file = paste0(path, ".csv"), row.names = FALSE)

+ 69
- 0
simulations/plots.R View File

@@ -0,0 +1,69 @@
### Method Compair plots.
for (file in list.files('results', pattern = 'method_compair_[0-9T-]+\\.csv')) {
# Read simulation result data set.
ds <- read.csv(file.path('results', file), stringsAsFactors = FALSE)
# Get dataset groups
datasets <- unlist(unique(ds["dataset"]))
# All seperate dataset plots into one file.
pdf(file.path('results', sub('.csv', '.pdf', file, fixed = TRUE)))
for (ds.name in datasets) {
boxplot(ds[ds["dataset"] == ds.name, colnames(ds) != "dataset"],
main = ds.name, ylab = 'Error', las = 2)
}
dev.off()
# Plot each dataset seperately.
for (ds.name in datasets) {
pdf(file.path('results', sub('.csv',
paste0('_', ds.name, '.pdf'),
file, fixed = TRUE)))
boxplot(ds[ds["dataset"] == ds.name, colnames(ds) != "dataset"],
main = ds.name, ylab = 'Error', las = 2)
dev.off()
}
}

### Method Compair plots for parameterized M2.
for (file in list.files('results', 'method_compair_M2_[0-9T-]+\\.csv')) {
# Read simulation results.
ds <- read.csv(file.path('results', file), stringsAsFactors = FALSE)
# Extract M2 dataset parameters.
pmix <- unique(ds$pmix)
lambda <- unique(ds$lambda)
# setup pdf file and plot layout.
pdf(file.path('results', sub('.csv', '.pdf', file, fixed = TRUE)),
width = 15, height = 10)
layout(matrix(seq(length(pmix) * length(lambda)),
ncol = length(lambda), byrow = TRUE))
for (p in pmix) {
for (l in lambda) {
boxplot(ds[ds$pmix == p & ds$lambda == l,
!(colnames(ds) %in% c("pmix", "lambda"))],
main = substitute(expression(paste(
lambda, ' = ', l, ', ', p[mix], ' = ', q
)), list(l = l, q = p)))
}
}
dev.off()
}

### Analyse predict dimension simulation.
for (file in list.files('results', 'predict_dim_compair_[0-9T-]+\\.csv')) {
# Read simulation result data set.
ds <- read.csv(file.path('results', file), stringsAsFactors = FALSE)
# Over all datasets (including M2 with default parameters)
true.dim <- ds$true.dim
dataset <- ds$dataset
df <- ds[!(colnames(ds) %in% c("true.dim", "dataset"))]
accuracy <- data.frame(
all = colMeans(df == true.dim),
M1 = colMeans(df[dataset == "M1", ] == true.dim[dataset == "M1"]),
M2 = colMeans(df[dataset == "M2", ] == true.dim[dataset == "M2"]),
M3 = colMeans(df[dataset == "M3", ] == true.dim[dataset == "M3"]),
M4 = colMeans(df[dataset == "M4", ] == true.dim[dataset == "M4"]),
M5 = colMeans(df[dataset == "M5", ] == true.dim[dataset == "M5"]),
M6 = colMeans(df[dataset == "M6", ] == true.dim[dataset == "M6"]),
M7 = colMeans(df[dataset == "M7", ] == true.dim[dataset == "M7"])
)
cat("Accuracy of simulation:", file, '\n')
print(accuracy)
}

+ 54
- 0
simulations/predict_dim_compair.R View File

@@ -0,0 +1,54 @@
# Comparison of dimension prediction methods including maves cross-validation
# with meanMAVE as basis for reference.
#
# Results are written to a CSV file with a dataset column and its simulation nr.
#
# Note:
# All methods are called via interface functions defined in
# 'simulation_interfaces.R'. In addition this file provides a few convenience
# functions.
source('simulation_interfaces.R')

# Number of simulations
NR.SIM <- 100L
NR.DS <- 7L
max.dim <- 5L

# List of dimension prediction methods.
methods <- c("cv", "elbow", "wilcoxon")

# Build result data matrix (repeat each dataset simulation times).
result <- matrix(NA, nrow = NR.DS * NR.SIM,
ncol = length(methods) + 2) # +2 for dataset and meanMAVE.
result[, 1L] <- rep(paste0("M", seq(NR.DS)), NR.SIM)
colnames(result) <- c("dataset", "true.dim", "mave.cv", methods)

# Create a progress logger for length many reports.
logger <- progress_logger(NR.DS * NR.SIM)
for (sim in seq_len(nrow(result))) {
# Create new dataset.
with(dataset(result[sim, 1L]), { X <<- X; Y <<- Y; B <<- B })
# Write true dr dimension.
result[sim, "true.dim"] <- ncol(B)

# Report simulation progress to user.
logger('')

# Call CVE and let each dimension prediction method estimate.
dr <- cve.call(X = X, Y = Y, max.dim = max.dim)
for (method in methods) {
result[sim, method] <- predict_dim(dr, method = method)$k
}

# The same using meanMAVE and MAVE's 'mave.dim'.
dr <- meanMAVE(X = X, Y = Y, max.dim = max.dim)
result[sim, "mave.cv"] <- which.min(mave.dim(dr)$cv)

# explicit call to the garbage collector.
gc()
}

path <- paste0(getwd(),
format(Sys.time(), '/results/predict_dim_compair_%Y-%m-%dT%H%M%S'))
# Write entire simulation results into a single file.
write.csv(result, file = paste0(path, ".csv"), row.names = FALSE)

+ 128
- 0
simulations/runtime_test.R View File

@@ -0,0 +1,128 @@
# Usage:
# ~$ Rscript runtime_test.R

textplot <- function(...) {
text <- unlist(list(...))
if (length(text) > 20) {
text <- c(text[1:17],
' ...... (skipped, text too long) ......',
text[c(-1, 0) + length(text)])
}

plot(NA, xlim = c(0, 1), ylim = c(0, 1),
bty = 'n', xaxt = 'n', yaxt = 'n', xlab = '', ylab = '')
for (i in seq_along(text)) {
text(0, 1 - (i / 20),
text[[i]], pos = 4)
}
}

# library(CVEpureR) # load CVE's pure R implementation
library(CVE) # load CVE

#' Writes log information to console. (to not get bored^^)
tell.user <- function(name, start, i, length) {
cat("\rRunning Test (", name, "):",
i, "/", length,
" - elapsed:", format(Sys.time() - start), "\033[K")
}
#' Computes "distance" of spanned subspaces.
#' @param B1 Semi-orthonormal basis matrix
#' @param B2 Semi-orthonormal basis matrix
#' @return Frobenius norm of subspace projection matrix diff.
subspace.dist <- function(B1, B2){
P1 <- tcrossprod(B1, B1)
P2 <- tcrossprod(B2, B2)
return(norm(P1 - P2, type = 'F'))
}

# Set random seed
set.seed(437)

# Number of simulations
SIM.NR <- 50L
# maximal number of iterations in curvilinear search algorithm
MAXIT <- 50L
# number of arbitrary starting values for curvilinear optimization
ATTEMPTS <- 10L
# set names of datasets
ds.names <- paste0("M", seq(7))
# Set used CVE method
methods <- c("simple", "weighted") # c("legacy", "simple", "linesearch", "sgd")

# Setup error and time tracking variables
error <- matrix(NA, SIM.NR, length(methods) * length(ds.names))
time <- matrix(NA, SIM.NR, ncol(error))
colnames(error) <- kronecker(paste0(ds.names, '-'), methods, paste0)
colnames(time) <- colnames(error)

# Create new log file and write CSV (actualy TSV) header.
# (do not overwrite existing logs)
log.nr <- length(list.files('tmp/', pattern = '.*\\.log'))
file <- file.path('tmp', paste0('test', log.nr, '.log'))
cat('dataset\tmethod\terror\ttime\n', sep = '', file = file)
# Open a new pdf device for plotting into (do not overwrite existing ones)
path <- paste0('test', log.nr, '.pdf')
pdf(file.path('tmp', path))
cat('Plotting to file:', path, '\n')

# only for telling user (to stdout)
count <- 0
start <- Sys.time()
# Start simulation loop.
for (sim in 1:SIM.NR) {
# Repeat for each dataset.
for (name in ds.names) {
tell.user(name, start, (count <- count + 1), SIM.NR * length(ds.names))

# Create a new dataset
ds <- dataset(name)
# Prepare X, Y and combine to data matrix
Y <- ds$Y
X <- ds$X
data <- cbind(Y, X)
# get dimensions
k <- ncol(ds$B)

for (method in methods) {
dr.time <- system.time(
dr <- cve.call(X, Y,
method = method,
k = k,
attempts = ATTEMPTS
)
)
dr$B <- coef(dr, k)

key <- paste0(name, '-', method)
error[sim, key] <- subspace.dist(dr$B, ds$B) / sqrt(2 * k)
time[sim, key] <- dr.time["elapsed"]

# Log results to file (mostly for long running simulations)
cat(paste0(
c(name, method, error[sim, key], time[sim, key]),
collapse = '\t'
), '\n',
sep = '', file = file, append = TRUE
)
}
}
}

cat("\n\n## Time [sec] Summary:\n")
print(summary(time))
cat("\n## Error Summary:\n")
print(summary(error))

boxplot(error,
main = paste0("Error (Nr of simulations ", SIM.NR, ")"),
ylab = "Error",
las = 2
)
boxplot(time,
main = paste0("Time (Nr of simulations ", SIM.NR, ")"),
ylab = "Time [sec]",
las = 2
)


+ 226
- 0
simulations/simulation_interfaces.R View File

@@ -0,0 +1,226 @@
library(MASS) # required by 'dr'
library(dr) # for `dr`
library(MAVE) # for `mave`, `coef.mave`.
library(CVE)

################################################################################
### simulation utility functions ###
################################################################################

#' Calc distance between two subspaces.
#'
#' || P(A) - P(B) ||_F / sqrt(2 * ncol(A))
#'
#' such that `P(A) = A (A^T A)^-1 A^T` the projection onto
#' the subspace spanned by `A`.
#'
#' @param A matrix which rows span a subspace.
#' @param B matrix of the same dim. as `A`-
#'
#' @returns numeric metric of spanning subspaces distance.
subspace_dist <- function(A, B) {
P <- A %*% solve(t(A) %*% A, t(A))
Q <- B %*% solve(t(B) %*% B, t(B))
norm(P - Q, 'F') / sqrt(2 * ncol(A))
}

#' Creats a progress logger.
#'
#' @param len number of calls to the creted rpogress logger till finished.
#'
#' @returns progress logger function:
#' \code{logger(info)}
#' which is to be called \code{len} times with a string \code{info}.
progress_logger <- function(len) {
start <- Sys.time()
count <- 0
return(function(info) {
cat("\rRunning (", info, "):",
(count <<- count + 1), "/", len,
" - elapsed:", format(Sys.time() - start), "\033[K")
if (count == len) { cat("\n") }
})
}

################################################################################
### cve wrapper ###
################################################################################
CVE <- function(...) {
.cve <- match.call(expand.dots = TRUE)
.cve[[1L]] <- quote(cve.call)
.cve$method <- "simple"
eval(.cve)
}
wCVE <- function(...) {
.cve <- match.call(expand.dots = TRUE)
.cve[[1L]] <- quote(cve.call)
.cve$method <- "weighted"
eval(.cve)
}
rCVE <- function(...) {
.cve <- match.call(expand.dots = TRUE)
.cve[[1L]] <- quote(cve.call)
.cve$method <- "simple"
dr <- eval(.cve)

for (name in names(dr$res)) {
.cve$method <- "weighted"
.cve$k <- dr$res[[name]]$k
.cve$V.init <- dr$res[[name]]$V
dr$res[[name]] <- eval(.cve)$res[[name]]
}
dr$call <- match.call()
dr$call[[1L]] <- quote(cve)
dr$call$method <- "refined"
return(dr)
}

wls <- function(x,y,w){
x <- as.matrix(x); y <- as.vector(y)
n=dim(x)[1];
p=dim(x)[2]-1
out=c(solve(t(x*w)%*%x/n)%*%apply(x*y*w,2,mean))
return(list(a=out[1],b=out[2:(p+1)]))
}

#' Compute kernel matrix.
#'
#' @param X matrix of dim. `n x p` (`n` samples of dimension `p`)
#' @param h numeric bandwidth.
#'
#' @returns `n x n` matrix with kernel of X_i to X_j interaction
#' in its `i, j` component.
kern <- function(X, h) {
X <- as.matrix(X)
n <- nrow(X)

k2 <- X %*% t(X) # (X_i^T X_j)_ij
k3 <- matrix(diag(k2), n, n) # diag(k2) in rows
k1 <- t(k3) # diag(k2) in cols
K <- k1 - 2 * k2 + k3 # (|X_i|^2 - 2 * X_i^T X_j - |X_j|^2)_ij
return(exp((-0.5 / h^2) * K))
}

coef.opg <- function(object, ...) {
return(object$B)
}

OPG <- function(X, Y, k, c0 = 2.34) {
# Cast parameters.
X <- as.matrix(X); Y <- as.vector(Y)
# Get Dimensions
n <- nrow(X); p <- ncol(X)

p0 = max(p, 3)
rn <- n^(-1 / (2 * (p0 + 6)))
h <- c0 * n^(-1 / (p0 + 6))
sig <- diag(var(X))
X <- scale(X)
kmat <- kern(X, h)
bmat <- numeric()
for (i in 1:n) {
wi <- kmat[, i]
xi <- cbind(1, t(t(X) - X[i, ]))
bmat <- cbind(bmat, wls(xi, Y, wi)$b)
}
beta <- eigen(bmat %*% t(bmat))$vectors[, 1:k]
B <- diag(sig^(-1 / 2)) %*% beta

return(structure(list(B = B), class = "opg"))
}

rOPG <- function(X, Y, k, nr.iter = 25L, c0 = 2.34) {
# Cast parameters.
X <- as.matrix(X); Y <- as.vector(Y)
# Get Dimensions
n <- nrow(X); p <- ncol(X)
sig <- diag(var(X))
X <- scale(X)

p0 = max(p, 3)
rn <- n^(-1 / (2 * (p0 + 6)))
h <- c0 * n^(-1 / (p0 + 6))
beta <- diag(p)
for(. in seq_len(nr.iter)) {
kmat <- kern(X %*% beta, h)
bmat <- numeric()
for (i in 1:n) {
wi <- kmat[, i]
xi <- cbind(1, t(t(X) - X[i, ]))
bmat <- cbind(bmat, wls(xi, Y, wi)$b)
}
beta <- eigen(bmat %*% t(bmat))$vectors[, 1:k, drop = F]
h <- max(rn * h, c0 * n^((-1/(k + 4))))
}
B <- diag(sig^(-1/2)) %*% beta

return(structure(list(B = B), class = "opg"))
}

rmave <- function(X, Y, k, nr.iter = 25L, c0 = 2.34) {
X <- as.matrix(X); Y <- as.vector(Y)
# Get dimensions.
n <- nrow(X); p <- ncol(X)

X <- scale(X)
sig <- attr(X, "scaled:scale")^2 # diag(var(X))

p0 <- max(p,3)
h <- c0 * n^(-1 / (p0 + 6))
rn <- n^(-1 / (2 * (p0 + 6)))

beta <- OPG(X, Y, k)$B
for (. in seq_len(nr.iter)) {
kermat <- kern(X %*% beta, h)
mkermat <- colMeans(kermat)

a <- numeric()
b <- numeric()
for (i in 1:n) {
wi <- kermat[ ,i] / mkermat[i]
ui <- cbind(1, t(t(X) - X[i, ]) %*% beta)
out <- wls(ui, Y, wi)
a <- c(a, out$a)
b <- cbind(b, out$b)
}
out <- 0
out1 <- 0
for (i in 1:n) {
xi <- kronecker(t(t(X) - X[i, ]), t(b[, i]))
yi <- Y - a[i]
wi <- kermat[, i] / mkermat[i]
out <- out + colMeans(xi * yi * wi)
out1 <- out1 + t(xi * wi) %*% xi / n
}
beta <- t(matrix(solve(out1, out), k, p))
h <- max(rn * h, c0 * n^(-1 / (k + 4)))
}
B <- diag(sig^-0.5) %*% beta
return(structure(list(B = B), class = "opg"))
}

meanOPG <- function(X, Y, max.dim = 10L, ...) {
mave(Y ~ X, method = "meanOPG", max.dim = max.dim)
}

meanMAVE <- function(X, Y, max.dim = 10L, ...) {
mave(Y ~ X, method = "meanMAVE", max.dim = max.dim)
}

coef.phdy <- coef.sir <- coef.save <- function(object, k, ...) {
matrix(object$evectors[, 1:k], ncol = k)
}

phdy <- function(X, Y, k, ...) {
dr(Y ~ X, method = 'phdy', numdir = k)
}

sir <- function(X, Y, k, ...) {
dr(Y ~ X, method = 'sir', numdir = k)
}

save <- function(X, Y, k, ...) {
dr(Y ~ X, method = 'save', numdir = k)
}

Loading…
Cancel
Save