wip: GMLM tex

2022-10-13 10:53:45 +02:00 · 2022-10-13 10:53:45 +02:00 · cc9222cb62
commit cc9222cb62
parent 79794f01ac
1 changed files with 202 additions and 138 deletions
--- a/LaTeX/GMLM.tex
+++ b/LaTeX/GMLM.tex
@ -94,17 +94,17 @@

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{Notation}
-Vectors are write as boldface lowercase letters (e.g. $\mat a$, $\mat b$), matrices use boldface uppercase or Greek letters (e.g. $\mat A$, $\mat B$, $\mat\alpha$, $\mat\Delta$). The identity matrix of dimensions $p\times p$ is denoted by $\mat{I}_p$ and the commutation matrix as $\mat{K}_{p, q}$ or $\mat{K}_p$ is case of $p = q$. Tensors, meaning multi-dimensional arrays of order at least 3, use uppercase calligraphic letters (e.g. $\ten{A}$, $\ten{B}$, $\ten{X}$, $\ten{Y}$, $\ten{F}$). Boldface indices (e.g. $\mat{i}, \mat{j}, \mat{k}$) denote multi-indices $\mat{i} = (i_1, ..., i_r)\in[\mat{d}]$ where the bracket notation is a shorthand for $[r] = \{1, ..., r\}$ which in conjunction with a multi-index as argument means $[\mat{d}] = [d_1]\times ... \times[d_K]$.
+Vectors are write as boldface lowercase letters (e.g. $\mat a$, $\mat b$), matrices use boldface uppercase or Greek letters (e.g. $\mat A$, $\mat B$, $\mat\alpha$, $\mat\Delta$). The identity matrix of dimensions $p\times p$ is denoted by $\mat{I}_p$ and the commutation matrix as $\mat{K}_{p, q}$ or $\mat{K}_p$ is case of $p = q$. Tensors, meaning multi-dimensional arrays of order at least 3, use uppercase calligraphic letters (e.g. $\ten{A}$, $\ten{B}$, $\ten{X}$, $\ten{Y}$, $\ten{F}$). Boldface indices (e.g. $\mat{i}, \mat{j}, \mat{k}$) denote multi-indices $\mat{i} = (i_1, ..., i_r)\in[\mat{d}]$ where the bracket notation is a shorthand for $[r] = \{1, ..., r\}$ which in conjunction with a multi-index as argument means $[\mat{d}] = [d_1]\times ... \times[d_K] = \{ (i_1, ..., i_r)\in\mathbb{N}^r : 1\leq i_k\leq d_k, \forall k = 1, ..., r  \}$.

 Let $\ten{A} = (a_{i_1,...,i_r})\in\mathbb{R}^{d_1\times ...\times d_r}$ be an order\footnote{Also called rank, therefore the variable name $r$, but this term is \emph{not} used as it leads to confusion with the rank as in ``the rank of a matrix''.} $r$ tensor where $r\in\mathbb{N}$ is the number of modes or axis of $\ten{A}$. For matrices $\mat{B}_k\in\mathbb{R}^{p_k\times d_k}$ with $k\in[r] = \{1, 2, ..., r\}$ the \emph{multi-linear multiplication} is defined element wise as
 \begin{displaymath}
    (\ten{A}\times\{\mat{B}_1, ..., \mat{B}_r\})_{j_1, ..., j_r} = \sum_{i_1, ..., i_r = 1}^{d_1, ..., d_r} a_{i_1, ..., i_r}(B_{1})_{j_1, i_1} \cdots (B_{r})_{j_r, i_r}
 \end{displaymath}
-which results in an order $r$ tensor of dimensions $p_1\times ...\times p_k)$. With this the \emph{$k$-mode product} between the tensor $\ten{A}$ with the matrix $\mat{B}_k$ is given by
+which results in an order $r$ tensor of dimensions $p_1\times ...\times p_k$. With this the \emph{$k$-mode product} between the tensor $\ten{A}$ with the matrix $\mat{B}_k$ is given by
 \begin{displaymath}
    \mat{A}\times_k\mat{B}_k = \ten{A}\times\{\mat{I}_{d_1}, ..., \mat{I}_{d_{k-1}}, \mat{B}_{k}, \mat{I}_{d_{k+1}}, ..., \mat{I}_{d_r}\}.
 \end{displaymath}
-Furthermore, the notation $\ten{A}\times_{k\in S}$ is a short hand for writing the iterative application if the mode product for all indices in $S\subset[r]$. For example $\ten{A}\times_{k\in\{2, 5\}}\mat{B}_k = \ten{A}\times_2\mat{B}_2\times_5\mat{B}_5$. By only allowing $S$ to be a set this notation is unambiguous because the mode products commutes for different modes $j\neq k\Rightarrow\ten{A}\times_j\mat{B}_j\times_k\mat{B}_k = \ten{A}\times_k\mat{B}_k\times_j\mat{B}_j$.
+Furthermore, the notation $\ten{A}\times_{k\in S}$ is a short hand for writing the iterative application if the mode product for all indices in $S\subseteq[r]$. For example $\ten{A}\times_{k\in\{2, 5\}}\mat{B}_k = \ten{A}\times_2\mat{B}_2\times_5\mat{B}_5$. By only allowing $S$ to be a set, this notation is unambiguous, because the mode products commutes for different modes $j\neq k\Rightarrow\ten{A}\times_j\mat{B}_j\times_k\mat{B}_k = \ten{A}\times_k\mat{B}_k\times_j\mat{B}_j$.

 The \emph{inner product} between two tensors of the same order and dimensions is
 \begin{displaymath}
@ -112,106 +112,66 @@ The \emph{inner product} between two tensors of the same order and dimensions is
 \end{displaymath}
 with which the \emph{Frobenius Norm} $\|\ten{A}\|_F = \sqrt{\langle\ten{A}, \ten{A}\rangle}$. Of interest is also the \emph{maximum norm} $\|\ten{A}\|_{\infty} = \max_{i_1, ..., i_K} a_{i_1, ..., i_K}$. Furthermore, the Frobenius and maximum norm are also used for matrices while for a vector $\mat{a}$ the \emph{2 norm} is $\|\mat{a}\|_2 = \sqrt{\langle\mat{a}, \mat{a}\rangle}$.

-Matrices and tensor can be \emph{vectorized} by the \emph{vectorization} operator $\vec$. For tensors of order at least $2$ the \emph{flattening} (or \emph{unfolding} or \emph{matricization}) is a reshaping of the tensor into a matrix along an particular mode. For a tensor $\ten{A}$ of order $r$ and dimensions $d_1, ..., d_r$ the $k$-mode unfolding $\ten{A}_{(k)}$ is a $d_k\times \prod_{l=1, l\neq k}d_l$ matrix. For the tensor $\ten{A} = (a_{i_1,...,i_r})\in\mathbb{R}^{d_1, ..., d_r}$ the elements of the $k$ unfolded tensor $\ten{A}_{(k)}$ are
+Matrices and tensor can be \emph{vectorized} by the \emph{vectorization} operator $\vec$. Sometimes only the order of elements in an object are of interest, therefore we use the notation $\ten{A}\equiv \ten{B}$ for objects $\ten{A}, \ten{B}$ of any shape if and only if $\vec{\ten{A}} = \vec{\ten{B}}$. For tensors of order at least $2$ the \emph{flattening} (or \emph{unfolding} or \emph{matricization}) is a reshaping of the tensor into a matrix along an particular mode. For a tensor $\ten{A}$ of order $r$ and dimensions $d_1, ..., d_r$ the $k$-mode unfolding $\ten{A}_{(k)}$ is a $d_k\times \prod_{l=1, l\neq k}d_l$ matrix. For the tensor $\ten{A} = (a_{i_1,...,i_r})\in\mathbb{R}^{d_1, ..., d_r}$ the elements of the $k$ unfolded tensor $\ten{A}_{(k)}$ are
 \begin{displaymath}
    (\ten{A}_{(k)})_{i_k, j} = a_{i_1, ..., i_r}\quad\text{ with }\quad j = 1 + \sum_{\substack{l = 1\\l \neq k}}^r (i_l - 1) \prod_{\substack{m = 1\\m\neq k}}^{l - 1}d_m.
 \end{displaymath}

 The rank of a tensor $\ten{A}$ of dimensions $d_1\times ...\times d_r$ is given by a vector $\rank{\ten{A}} = (a_1, ..., a_r)\in[d_1]\times...\times[d_r]$ where $a_k = \rank(\ten{A}_{(k)})$ is the usual matrix rank of the $k$ unfolded tensor.

+\subsection{Sufficient Dimension Reduction}
+\todo{TODO!!!}

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
                \section{Quadratic Exponential Family GLM}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+We propose a model based inverse regression method for estimation .... \todo{TODO!!!}
+
+\paragraph{Distribution}
+\begin{equation}\label{eq:exp-family}
+    f_{\mat{\theta}_y}(\ten{X}\mid Y = y) = h(\ten{X})\exp(\t{\mat{\eta}(\mat{\theta}_y)}\mat{t}(\ten{X}) - b(\mat{\theta}_y))
+\end{equation}
+
+\paragraph{(inverse) link}
+\begin{displaymath}
+    \invlink(\mat{\eta}(\mat{\theta}_y)) = \E_{\mat{\theta}_y}[\mat{t}(\ten{X})\mid Y = y]
+\end{displaymath}
+\paragraph{(multi) linear predictor} For
+\begin{displaymath}
+    \mat{\eta}_y = \mat{\eta}(\mat{\theta}_y) = \begin{pmatrix}
+        \mat{\eta}_1(\mat{\theta}_y) \\
+        \mat{\eta}_2(\mat{\theta}_y)
+    \end{pmatrix},\qquad
+    \mat{t}(\ten{X}) = \begin{pmatrix}
+        \mat{t}_1(\ten{X}) \\
+        \mat{t}_2(\ten{X})
+    \end{pmatrix} = \begin{pmatrix}
+        \vec{\ten{X}} \\
+        \vec{\ten{X}}\otimes\vec{\ten{X}}
+    \end{pmatrix}
+\end{displaymath}
+where
+\begin{align*}
+    \mat{\eta}_1(\mat{\theta}_y) &= \mat{\eta}_{y,1} = c_1 \vec(\overline{\ten{\eta}}_1 + \ten{F}_y\times_{k\in[r]}\mat{\alpha}_k) \\
+    \mat{\eta}_2(\mat{\theta}_y) &= \mat{\eta}_{y,2} = c_2 \vec{\bigotimes_{k = r}^1 \mat{\Omega}_k}
+\end{align*}
+with model parameters $\overline{\ten{\eta}}_1, \mat{\alpha}_1, ..., \mat{\alpha}_r, \mat{\Omega}_1, ..., \mat{\Omega}_r$ where $\overline{\ten{\eta}}_1$ is a $p_1\times ... \times p_r$ tensor, $\mat{\alpha}_j$ are $p_j\times q_j$ unconstrained matrices and $\mat{\Omega}_j$ are symmetric $p_j\times p_j$ matrices for each of the $j = 1, ..., r$ modes. Finally, $c_1$ and $c_2$ are known constants simplifying modeling for specific distributions.

-\begin{description}
-    \item[Distribution]
-        \begin{displaymath}
-            f_{\mat{\theta}_y}(\ten{X}\mid Y = y) = h(\ten{X})\exp(\t{\mat{\eta}(\mat{\theta}_y)}\mat{t}(\ten{X}) - b(\mat{\theta}_y))
-        \end{displaymath}
-    \item[(inverse) link]
-        \begin{displaymath}
-            \invlink(\mat{\eta}(\mat{\theta}_y)) = \E_{\mat{\theta}_y}[\mat{t}(\ten{X})\mid Y = y]
-        \end{displaymath}
-    \item[(multi) linear predictor] For
-        \begin{displaymath}
-            \mat{\eta}_y = \mat{\eta}(\mat{\theta}_y) = \begin{pmatrix}
-                \mat{\eta}_1(\mat{\theta}_y) \\
-                \mat{\eta}_2(\mat{\theta}_y)
-            \end{pmatrix},\qquad
-            \mat{t}(\ten{X}) = \begin{pmatrix}
-                \mat{t}_1(\ten{X}) \\
-                \mat{t}_2(\ten{X})
-            \end{pmatrix} = \begin{pmatrix}
-                \vec{\ten{X}} \\
-                \vec{\ten{X}}\otimes\vec{\ten{X}}
-            \end{pmatrix}
-        \end{displaymath}
-        where
-        \begin{align*}
-            \mat{\eta}_1(\mat{\theta}_y) &= \mat{\eta}_{y,1} = c_1 \vec(\overline{\ten{\eta}}_1 + \ten{F}_y\times_{k\in[r]}\mat{\alpha}_k) \\
-            \mat{\eta}_2(\mat{\theta}_y) &= \mat{\eta}_{y,2} = c_2 \vec{\bigotimes_{k = r}^1 \mat{\Omega}_k}
-        \end{align*}
-        with model parameters $\overline{\ten{\eta}}_1, \mat{\alpha}_1, ..., \mat{\alpha}_r, \mat{\Omega}_1, ..., \mat{\Omega}_r$ where $\overline{\ten{\eta}}_1$ is a $p_1\times ... \times p_r$ tensor, $\mat{\alpha}_j$ are $p_j\times q_j$ unconstrained matrices and $\mat{\Omega}_j$ are symmetric $p_j\times p_j$ matrices for each of the $j = 1, ..., r$ modes. Finally, $c_1$ and $c_2$ are known constants simplifying modeling for specific distributions.
-\end{description}
-% With that approach we get
-% \begin{displaymath}
-%     \t{\mat{\eta}(\mat{\theta}_y)}\mat{t}(\ten{X}) = \t{\mat{\eta}_{y,1}}\mat{t}_1(\ten{X}) + \t{\mat{\eta}_{y,2}}\mat{t}_2(\ten{X}) = \langle\overline{\ten{\eta}}_1 + \ten{F}_y\times_{k\in[r]}\mat{\alpha}_k, \ten{X} \rangle + \langle\ten{X}\times_{k\in[r]}\mat{\Omega}_k, \ten{X} \rangle.
-% \end{displaymath}

 \begin{theorem}[Log-Likelihood and Score]
    For $n$ i.i.d. observations $(\ten{X}_i, y_i), i = 1, ..., n$ the log-likelihood has the form
    \begin{displaymath}
        l(\mat{\eta}_y) = \sum_{i = 1}^n(\log h(\ten{X}_i) + c_1\langle\overline{\ten{\eta}}_1 + \ten{F}_{y_i}\times_{k\in[r]}\mat{\alpha}_k, \ten{X}_i \rangle + c_2\langle\ten{X}_i\times_{k\in[r]}\mat{\Omega}_k, \ten{X}_i \rangle - b(\mat{\eta}_{y_i})).
    \end{displaymath}
-    % The MLE estimate for the intercept term $\overline{\ten{\eta}}_1$ is
-    % \begin{displaymath}
-    %     \widehat{\ten{\eta}}_1 = \frac{1}{n}\sum_{i = 1}^n \ten{X}_i
-    % \end{displaymath}
    The gradients with respect to the GLM parameters $\overline{\ten{\eta}}_1$, $\mat{\alpha}_j$ and $\mat{\Omega}_j$ for $j = 1, ..., r$ are given by
    \begin{align*}
-        \nabla_{\overline{\ten{\eta}}_1}l &= c_1\sum_{i = 1}^n \reshape{\mat{p}}(\mat{t}_1(\ten{X}_i) - \invlink_1(\mat{\eta}_{y_i})), \\
+        % \nabla_{\overline{\ten{\eta}}_1}l &= c_1\sum_{i = 1}^n \reshape{\mat{p}}(\mat{t}_1(\ten{X}_i) - \invlink_1(\mat{\eta}_{y_i})), \\
+        \nabla_{\overline{\ten{\eta}}_1}l &\equiv c_1\sum_{i = 1}^n (\mat{t}_1(\ten{X}_i) - \invlink_1(\mat{\eta}_{y_i})), \\
        \nabla_{\mat{\alpha}_j}l &= c_1 \sum_{i = 1}^n \reshape{\mat{p}}(\mat{t}_1(\ten{X}_i) - \invlink_1(\mat{\eta}_{y_i}))_{(j)}\t{(\ten{F}_{y_i}\times_{k\in[r]\backslash j}\mat{\alpha}_k)_{(j)}}, \\
-        \vec\nabla_{\mat{\Omega}_j}l &= c_2 \mat{D}_{p_j}\t{\mat{D}_{p_j}} \reshape{(\mat{p}, \mat{p})}\!\!\Big(\sum_{i = 1}^n(\mat{t}_2(\ten{X}_i) - \invlink_2(\mat{\eta}_{y_i}))\Big)_{(j, r + j)}\vec\bigotimes_{\substack{k = r\\k\neq j}}^{1}\mat{\Omega}_k
+        \nabla_{\mat{\Omega}_j}l &\equiv c_2 \mat{D}_{p_j}\t{\mat{D}_{p_j}} \reshape{(\mat{p}, \mat{p})}\!\!\Big(\sum_{i = 1}^n(\mat{t}_2(\ten{X}_i) - \invlink_2(\mat{\eta}_{y_i}))\Big)_{(j, r + j)}\vec\bigotimes_{\substack{k = r\\k\neq j}}^{1}\mat{\Omega}_k
    \end{align*}
-    % The Fisher Information for the GLM parameters is given block wise by
-    % \begin{displaymath}
-    %     % \mathcal{I}_{\ten{X}\mid Y = y}(\vec{\overline{\ten{\eta}}_1}, \vec\mat{\alpha}_1, ..., \vec\mat{\alpha}_r, \vec\mat{\Omega}_1, ..., \vec\mat{\Omega}_r) = \begin{pmatrix}
-    %     \mathcal{I}_{\ten{X}\mid Y = y} = \begin{pmatrix}
-    %         \mathcal{I}(\overline{\ten{\eta}}_1) & \mathcal{I}(\overline{\ten{\eta}}_1, \mat{\alpha}_1) & \cdots & \mathcal{I}(\overline{\ten{\eta}}_1, \mat{\alpha}_r) & \mathcal{I}(\overline{\ten{\eta}}_1, \mat{\Omega}_1) & \cdots & \mathcal{I}(\overline{\ten{\eta}}_1, \mat{\Omega}_r) \\
-    %         \mathcal{I}(\mat{\alpha}_1, \overline{\ten{\eta}}_1) & \mathcal{I}(\mat{\alpha}_1) & \cdots & \mathcal{I}(\mat{\alpha}_1, \mat{\alpha}_r) & \mathcal{I}(\mat{\alpha}_1, \mat{\Omega}_1) & \cdots & \mathcal{I}(\mat{\alpha}_1, \mat{\Omega}_r) \\
-    %         \vdots & \vdots & \ddots & \vdots & \vdots & \ddots & \vdots \\
-    %         \mathcal{I}(\mat{\alpha}_r, \overline{\ten{\eta}}_1) & \mathcal{I}(\mat{\alpha}_r, \mat{\alpha}_1) & \cdots & \mathcal{I}(\mat{\alpha}_r) & \mathcal{I}(\mat{\alpha}_r, \mat{\Omega}_1) & \cdots & \mathcal{I}(\mat{\alpha}_r, \mat{\Omega}_r) \\
-    %         \mathcal{I}(\mat{\Omega}_1, \overline{\ten{\eta}}_1) & \mathcal{I}(\mat{\Omega}_1, \mat{\alpha}_1) & \cdots & \mathcal{I}(\mat{\Omega}_1, \mat{\alpha}_r) & \mathcal{I}(\mat{\Omega}_1) & \cdots & \mathcal{I}(\mat{\Omega}_1, \mat{\Omega}_r) \\
-    %         \vdots & \vdots & \ddots & \vdots & \vdots & \ddots & \vdots \\
-    %         \mathcal{I}(\mat{\Omega}_r, \overline{\ten{\eta}}_1) & \mathcal{I}(\mat{\Omega}_r, \mat{\alpha}_1) & \cdots & \mathcal{I}(\mat{\Omega}_r) & \mathcal{I}(\mat{\Omega}_r, \mat{\Omega}_1) & \cdots & \mathcal{I}(\mat{\Omega}_r)
-    %     \end{pmatrix}
-    % \end{displaymath}
-    % where
-    % \begin{align*}
-    %     \mathcal{I}(\overline{\ten{\eta}}_1) &= -\sum_{i = 1}^n \cov_{\mat{\theta}_{y_i}}(\vec\ten{X}\mid Y = y_i), \\
-    %     \mathcal{I}(\mat{\alpha}_j) &= -\sum_{i = 1}^n ((\ten{F}_{y_i}\times_{k\in[r]\backslash j}\mat{\alpha}_k)_{(j)}\otimes\mat{I}_{p_j})\mat{K}_{\mat{p},(j)}\cov_{\mat{\theta}_{y_i}}(\vec\ten{X}\mid Y = y_i)\t{\mat{K}_{\mat{p},(j)}}(\t{(\ten{F}_{y_i}\times_{k\in[r]\backslash j}\mat{\alpha}_k)_{(j)}}\otimes\mat{I}_{p_j}), \\
-    %     \mathcal{I}(\mat{\alpha}_j) &= -\sum_{i = 1}^n \todo{continue}
-    % \end{align*}
-
-    % \todo{Fisher Information}
 \end{theorem}

-Illustration of dimensions
-\begin{displaymath}
-    \underbrace{ \mat{D}_{p_j}\t{\mat{D}_{p_j}} }_{\makebox[0pt]{\scriptsize $p_j^2\times p_j^2$}}
-    %
-    \underbrace{%
-        \overbrace{\reshape{(\mat{p}, \mat{p})}\!\!\Big(\sum_{i = 1}^n
-            \underbrace{ (\mat{t}_2(\ten{X}_i) - \invlink_2(\mat{\eta}_{y_i}) }_{p^2\times 1}
-        \Big)}^{\substack{\text{(tensor of order $2 r$)}\\p_1\times p_2\times ... \times p_r\times p_1\times p_2\times ... \times p_r}} \!\!\makebox[0pt]{\phantom{\Big)}}_{(j, r + j)}
-    }_{\substack{p_j^2\times (p / p_j)^2\\\text{(matricized / put $j$ mode axis to the front)}}}
-    %
-    \underbrace{%
-        \vec \overbrace{ \bigotimes_{\substack{k = r\\k\neq j}}^{1}\mat{\Omega}_j }^{\makebox[0pt]{\scriptsize $(p/p_j)\times (p/p_j)$}}
-    }_{\makebox[0pt]{\scriptsize $(p/p_j)^2\times 1$}}
-\end{displaymath}
-

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
                \section{Sufficient Dimension Reduction}
@ -249,21 +209,17 @@ Simplifying leads to
 Now note $\Span(\mat{A}) = \Span(c \mat{A})$ for any matrix $\mat{A}$ and non-zero scalar $c$ as well as the definition $\mat{t}_1(\ten{X}) = \vec{\ten{X}}$ which proves the following.

 \begin{theorem}[SDR]\label{thm:sdr}
-    A sufficient reduction for the regression $Y\mid \ten{X}$ under the quadratic exponential family inverse regression model \todo{reg} is given by
+    A sufficient reduction for the regression $Y\mid \ten{X}$ under the quadratic exponential family inverse regression model \eqref{eq:exp-family} is given by
    \begin{align*}
-        R(\ten{X})
+        \mat{R}(\ten{X})
            &= \t{\mat{\beta}}(\vec{\ten{X}} - \E\vec{\ten{X}}) \\
            &\equiv \ten{X}\times_{k\in[r]}\t{\mat{\alpha}_k}.
    \end{align*}
-    for a $p\times q$ dimensional matrix $\mat{\beta}$ given by
-    \begin{displaymath}
-        \mat{\beta} = \bigotimes_{k = r}^{1}\mat{\alpha}_k
-    \end{displaymath}
-    which satisfies $\Span(\mat{\beta}) = \Span(\{\mat{\eta}_{Y,1} - \E_{Y}\mat{\eta}_{Y,1} : Y\in\mathcal{S}_Y\})$.
+    for a $p\times q$ dimensional matrix $\mat{\beta}=\bigotimes_{k = r}^{1}\mat{\alpha}_k$ which satisfies $\Span(\mat{\beta}) = \Span(\{\mat{\eta}_{Y,1} - \E_{Y}\mat{\eta}_{Y,1} : Y\in\mathcal{S}_Y\})$.
 \end{theorem}

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-                    \section{Special Distributions}
+                    \section{Examples}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 We illustrate the SDR method on two special cases, first the Tensor Normal distribution and second on the Multi-Variate Bernoulli distribution with vector, matrix and tensor valued predictors.

@ -344,70 +300,95 @@ then with $\overline{\ten{R}} = \frac{1}{n}\sum_{i = 1}^n \ten{R}_i$ we get
    \D l(\mat{\Omega}_j) &= \frac{1}{2}\t{\vec\Big(\frac{p}{p_j}\mat{\Omega}_j^{-1} - (\ten{X} + \mu_y)_{(j)}\t{(\ten{R}\times_{k\in[r]\backslash j}\mat{\Omega}_k)_{(j)}}\Big)}\mat{D}_{p_j}\t{\mat{D}_{p_j}}
 \end{align*}

-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\subsubsection{Initial Values}
-First we set the gradient with respect to $\overline{\ten{\eta}}_1$ to zero
-\begin{gather*}
-    0 \overset{!}{=} \nabla_{\overline{\ten{\eta}}_1}l = c_1\sum_{i = 1}^n (\ten{X}_i - \ten{\mu}_i) \\
-    \overline{\ten{X}} = (\overline{\ten{\eta}}_1 + \overline{\ten{F}_y}\times_{k\in[r]}\mat{\alpha}_{k})\times_{l\in[r]}\mat{\Omega}_{l}^{-1} \\
-    \overline{\ten{X}}\times_{l\in[r]}\mat{\Omega}_{l} = \overline{\ten{\eta}}_1 + \overline{\ten{F}_y}\times_{k\in[r]}\mat{\alpha}_{k} \approx \overline{\ten{\eta}}_1 \\
-    \overline{\ten{\eta}}_1^{(0)} = \overline{\ten{X}}\times_{k\in[r]}\mat{\Omega}_{k}^{(0)}
-\end{gather*}
-where the approximation is due to the assumption that $\E \ten{F}_y = 0$. For the initial values of the scatter matrices $\mat{\Omega}_{l}$ we simply ignore the relation to the response and simply estimate them as the marginal scatter matrices. These initial estimates overemphasize the variability in the reduction subspace. Therefore, we first compute the unscaled mode covariance estimates
-\begin{displaymath}
-    \widetilde{\mat{\Delta}}_j^{(0)} = \frac{p_j}{n p} (\ten{X} - \overline{\ten{X}})_{(j)}\t{(\ten{X} - \overline{\ten{X}})_{(j)}}.
-\end{displaymath}
-The next step is to scale them such that there Kronecker product has an appropriate trace
-\begin{displaymath}
-    \mat{\Delta}_j^{(0)} = \left(\frac{\|\ten{X} - \overline{\ten{X}}\|_F^2}{n \prod_{k = 1}^r \tr(\widetilde{\mat{\Delta}}_j^{(0)})}\right)^{1 / r} \widetilde{\mat{\Delta}}_j^{(0)}.
-\end{displaymath}
-Finally, the co-variances need to be inverted to give initial estimated of the scatter matrices
-\begin{displaymath}
-    \mat{\Omega}_j^{(0)} = (\mat{\Delta}_j^{(0)})^{-1}.
-\end{displaymath}
-The relay interesting part is to get initial estimates for the $\mat{\alpha}_j$ matrices. Setting the $\mat{\alpha}_j$ gradient to zero gives and substituting the initial estimates for $\overline{\ten{\eta}}_1$ and the $\mat{\Omega}_k$'s gives
-\begin{gather*}
-    0 \overset{!}{=} \nabla_{\mat{\alpha}_j}l = c_1 \sum_{i = 1}^n \reshape{\mat{p}}(\mat{t}_1(\ten{X}_i) - \mat{g}_1(\mat{\eta}_{y_i}))_{(j)}\t{(\ten{F}_{y_i}\times_{k\in[r]\backslash j}\mat{\alpha}_k)_{(j)}} \\
-    (\ten{X} - \overline{\ten{X}})_{(j)}\t{(\ten{F}_y \times_{k\in[r]\backslash j}\mat{\alpha}_k)_{(j)}}
-        = \mat{\Omega}_j^{(0)}\mat{\alpha}_j(\ten{F}_y\times_{k\in[r]\backslash j}\mat{\Omega}_k^{(0)}\mat{\alpha}_k)_{(j)}\t{(\ten{F}_y \times_{k\in[r]\backslash j}\mat{\alpha}_k)_{(j)}}
-\end{gather*}
-Now letting $\mat{\Sigma}_k$ be the mode co-variances of $\ten{F}_y$ and define $\ten{W}_y = \ten{F}_y\times_{k\in[r]}\mat{\Sigma}_k$ we get
-\begin{gather*}
-    (\ten{X} - \overline{\ten{X}})_{(j)}\t{(\ten{F}_y \times_{k\in[r]\backslash j}\mat{\alpha}_k)_{(j)}}
-        = \mat{\Omega}_j^{(0)}\mat{\alpha}_j\mat{\Sigma}_j^{1/2}(\ten{W}_y\times_{k\in[r]\backslash j}\mat{\Omega}_k^{(0)}\mat{\alpha}_k\mat{\Sigma}_k^{1/2})_{(j)}\t{(\ten{W}_y \times_{k\in[r]\backslash j}\mat{\alpha}_k \mat{\Sigma}_{k}^{1/2})_{(j)}}\mat{\Sigma}_{j}^{1/2} \\
-        = \mat{\Omega}_j^{(0)}\mat{\alpha}_j\mat{\Sigma}_j^{1/2}(\ten{W}_y)_{(j)}\Big(\mat{I}_n\otimes\bigotimes_{\substack{k = r\\k\neq j}}^{1}\mat{\Sigma}_k^{1/2}\t{\mat{\alpha}_k}\mat{\Omega}_k^{(0)}\mat{\alpha}_k\mat{\Sigma}_{k}^{1/2}\Big)\t{(\ten{W}_y)_{(j)}}\mat{\Sigma}_{j}^{1/2}.
-\end{gather*}
-Now we let $\mat{\alpha}_j^{(0)}$ be such that $\mat{\Sigma}_k^{1/2}\t{(\mat{\alpha}^{(0)}_k)}\mat{\Omega}_k^{(0)}\mat{\alpha}^{(0)}_k\mat{\Sigma}_{k}^{1/2} = \mat{I}_{p_j}$, which leads by substitution to
-\begin{displaymath}
-    (\ten{X} - \overline{\ten{X}})_{(j)}\t{(\ten{F}_y\times_{k\in[r]\backslash j}\mat{\alpha}^{(0)}_k)_{(j)}}
-    = \mat{\Omega}_j^{(0)}\mat{\alpha}^{(0)}_j\mat{\Sigma}_j^{1/2}(\ten{W}_y)_{(j)}\t{(\ten{W}_y)_{(j)}}\mat{\Sigma}_{j}^{1/2}
-    = \frac{p_j}{n p}\mat{\Omega}_j^{(0)}\mat{\alpha}^{(0)}_j\mat{\Sigma}_j
-\end{displaymath}
-\todo{Does this make sense?!?!?!}
+% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+% \subsubsection{Initial Values}
+% First we set the gradient with respect to $\overline{\ten{\eta}}_1$ to zero
+% \begin{gather*}
+%     0 \overset{!}{=} \nabla_{\overline{\ten{\eta}}_1}l = c_1\sum_{i = 1}^n (\ten{X}_i - \ten{\mu}_i) \\
+%     \overline{\ten{X}} = (\overline{\ten{\eta}}_1 + \overline{\ten{F}_y}\times_{k\in[r]}\mat{\alpha}_{k})\times_{l\in[r]}\mat{\Omega}_{l}^{-1} \\
+%     \overline{\ten{X}}\times_{l\in[r]}\mat{\Omega}_{l} = \overline{\ten{\eta}}_1 + \overline{\ten{F}_y}\times_{k\in[r]}\mat{\alpha}_{k} \approx \overline{\ten{\eta}}_1 \\
+%     \overline{\ten{\eta}}_1^{(0)} = \overline{\ten{X}}\times_{k\in[r]}\mat{\Omega}_{k}^{(0)}
+% \end{gather*}
+% where the approximation is due to the assumption that $\E \ten{F}_y = 0$. For the initial values of the scatter matrices $\mat{\Omega}_{l}$ we simply ignore the relation to the response and simply estimate them as the marginal scatter matrices. These initial estimates overemphasize the variability in the reduction subspace. Therefore, we first compute the unscaled mode covariance estimates
+% \begin{displaymath}
+%     \widetilde{\mat{\Delta}}_j^{(0)} = \frac{p_j}{n p} (\ten{X} - \overline{\ten{X}})_{(j)}\t{(\ten{X} - \overline{\ten{X}})_{(j)}}.
+% \end{displaymath}
+% The next step is to scale them such that there Kronecker product has an appropriate trace
+% \begin{displaymath}
+%     \mat{\Delta}_j^{(0)} = \left(\frac{\|\ten{X} - \overline{\ten{X}}\|_F^2}{n \prod_{k = 1}^r \tr(\widetilde{\mat{\Delta}}_j^{(0)})}\right)^{1 / r} \widetilde{\mat{\Delta}}_j^{(0)}.
+% \end{displaymath}
+% Finally, the co-variances need to be inverted to give initial estimated of the scatter matrices
+% \begin{displaymath}
+%     \mat{\Omega}_j^{(0)} = (\mat{\Delta}_j^{(0)})^{-1}.
+% \end{displaymath}
+% The relay interesting part is to get initial estimates for the $\mat{\alpha}_j$ matrices. Setting the $\mat{\alpha}_j$ gradient to zero gives and substituting the initial estimates for $\overline{\ten{\eta}}_1$ and the $\mat{\Omega}_k$'s gives
+% \begin{gather*}
+%     0 \overset{!}{=} \nabla_{\mat{\alpha}_j}l = c_1 \sum_{i = 1}^n \reshape{\mat{p}}(\mat{t}_1(\ten{X}_i) - \mat{g}_1(\mat{\eta}_{y_i}))_{(j)}\t{(\ten{F}_{y_i}\times_{k\in[r]\backslash j}\mat{\alpha}_k)_{(j)}} \\
+%     (\ten{X} - \overline{\ten{X}})_{(j)}\t{(\ten{F}_y \times_{k\in[r]\backslash j}\mat{\alpha}_k)_{(j)}}
+%         = \mat{\Omega}_j^{(0)}\mat{\alpha}_j(\ten{F}_y\times_{k\in[r]\backslash j}\mat{\Omega}_k^{(0)}\mat{\alpha}_k)_{(j)}\t{(\ten{F}_y \times_{k\in[r]\backslash j}\mat{\alpha}_k)_{(j)}}
+% \end{gather*}
+% Now letting $\mat{\Sigma}_k$ be the mode co-variances of $\ten{F}_y$ and define $\ten{W}_y = \ten{F}_y\times_{k\in[r]}\mat{\Sigma}_k$ we get
+% \begin{gather*}
+%     (\ten{X} - \overline{\ten{X}})_{(j)}\t{(\ten{F}_y \times_{k\in[r]\backslash j}\mat{\alpha}_k)_{(j)}}
+%         = \mat{\Omega}_j^{(0)}\mat{\alpha}_j\mat{\Sigma}_j^{1/2}(\ten{W}_y\times_{k\in[r]\backslash j}\mat{\Omega}_k^{(0)}\mat{\alpha}_k\mat{\Sigma}_k^{1/2})_{(j)}\t{(\ten{W}_y \times_{k\in[r]\backslash j}\mat{\alpha}_k \mat{\Sigma}_{k}^{1/2})_{(j)}}\mat{\Sigma}_{j}^{1/2} \\
+%         = \mat{\Omega}_j^{(0)}\mat{\alpha}_j\mat{\Sigma}_j^{1/2}(\ten{W}_y)_{(j)}\Big(\mat{I}_n\otimes\bigotimes_{\substack{k = r\\k\neq j}}^{1}\mat{\Sigma}_k^{1/2}\t{\mat{\alpha}_k}\mat{\Omega}_k^{(0)}\mat{\alpha}_k\mat{\Sigma}_{k}^{1/2}\Big)\t{(\ten{W}_y)_{(j)}}\mat{\Sigma}_{j}^{1/2}.
+% \end{gather*}
+% Now we let $\mat{\alpha}_j^{(0)}$ be such that $\mat{\Sigma}_k^{1/2}\t{(\mat{\alpha}^{(0)}_k)}\mat{\Omega}_k^{(0)}\mat{\alpha}^{(0)}_k\mat{\Sigma}_{k}^{1/2} = \mat{I}_{p_j}$, which leads by substitution to
+% \begin{displaymath}
+%     (\ten{X} - \overline{\ten{X}})_{(j)}\t{(\ten{F}_y\times_{k\in[r]\backslash j}\mat{\alpha}^{(0)}_k)_{(j)}}
+%     = \mat{\Omega}_j^{(0)}\mat{\alpha}^{(0)}_j\mat{\Sigma}_j^{1/2}(\ten{W}_y)_{(j)}\t{(\ten{W}_y)_{(j)}}\mat{\Sigma}_{j}^{1/2}
+%     = \frac{p_j}{n p}\mat{\Omega}_j^{(0)}\mat{\alpha}^{(0)}_j\mat{\Sigma}_j
+% \end{displaymath}
+% \todo{Does this make sense?!?!?!}

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{Ising Model}
-For the inverse regression $\ten{X}\mid Y = y$ the Ising model probability mass function with $p (p + 1) / 2$ parameters $\mat{\theta}_y$ is given by
+The conditional Ising model for the inverse regression $\ten{X}\mid Y = y$ with $p (p + 1) / 2$ parameters $\mat{\theta}_y$ is given by
 \begin{align*}
    P_{\mat{\theta}_y}(\ten{X}\mid Y = y)
        &= p_0(\mat{\theta}_y)\exp(\t{\vech(\vec(\ten{X})\t{\vec(\ten{X})})}\mat{\theta}_y) \\
        &= h(\ten{X})\exp(\t{\mat{{\eta}}(\mat{\theta}_y)}\mat{t}(\ten{X}) - b(\mat{\theta}_y))
 \end{align*}
 where $h(\ten{X}) = 1$ and $b(\mat{\theta}_y) = -\log p_0(\mat{\theta}(\mat{\eta}_y))$.
-According to the GLM model we get
+Following the GMLM model we have model the natrual parameters as
 \begin{align*}
    \mat{\eta}_{y,1} &\equiv c_1 (\overline{\ten{\eta}}_1 + \ten{F}_y\times_{k\in[r]}\mat{\alpha}_k), &
    \mat{\eta}_{y,2} &\equiv c_2 \bigotimes_{k = r}^{1}\mat{\Omega}_k.
 \end{align*}
-which yields the following relation to the conditional Ising model parameters
+where we set the constants $c_1 = c_2 = 1$. This yields the following relation to the conditional Ising model parameters
 \begin{displaymath}
-    \mat{\theta}_y = \mat{\theta}(\mat{\eta}_y) = \vech(\diag(\mat{\eta}_{y,1}) + (2_{p\times p} - \mat{I}_p) \odot \reshape{(p, p)}(\mat{\eta}_{y,2}))
+    \mat{\theta}_y = \mat{\theta}(\mat{\eta}_y) = \vech(\diag(\mat{\eta}_{y,1}) + (1_{p\times p} - \mat{I}_p) \odot \reshape{(p, p)}(\mat{\eta}_{y,2}))
 \end{displaymath}
-where $c_1, c_2$ are non-zero known (modeled) constants between $0$ and $1$ such that $c_1 + c_2 = 1$. The ``inverse'' link in then computed via the Ising model as the conditional expectation of all interactions
-\begin{align*}
-    \invlink_2(\mat{\eta}_y) \equiv \E_{\mat{\theta}_y}[\vec(\ten{X})\t{\vec(\ten{X})}\mid Y = y]
-\end{align*}
-which incorporates the first moment. In other words $\invlink_1(\mat{\eta}_y) = \diag(\E_{\mat{\theta}_y}[\vec(\ten{X})\t{\vec(\ten{X})}\mid Y = y])$.
+Note that the diagonal elements of the $\mat{\Omega}_k$ are multiplied by $0$ which means they are ignored. This reflects the fact that under the Ising model (in general for the multivariate Bernoulli) holds $\E \ten{X}_{\mat{j}}\mid Y = \E \ten{X}_{\mat{j}}\ten{X}_{\mat{j}} \mid Y$ due to $0$ and $1$ entries only. Therefore our model overparamiterizes as the diagonal elements of $\mat{\Omega}_k$ and $\overline{\ten{\eta}}_1$ serve the same porpose. 
+
+% where $c_1, c_2$ are non-zero known (modeled) constants between $0$ and $1$ such that $c_1 + c_2 = 1$. The ``inverse'' link in then computed via the Ising model as the conditional expectation of all interactions
+% \begin{align*}
+%     \invlink_2(\mat{\eta}_y) \equiv \E_{\mat{\theta}_y}[\vec(\ten{X})\t{\vec(\ten{X})}\mid Y = y]
+% \end{align*}
+% which incorporates the first moment. In other words $\invlink_1(\mat{\eta}_y) = \diag(\E_{\mat{\theta}_y}[\vec(\ten{X})\t{\vec(\ten{X})}\mid Y = y])$.
+
+% \subsection{Ising Model}
+% For the inverse regression $\ten{X}\mid Y = y$ the Ising model probability mass function with $p (p + 1) / 2$ parameters $\mat{\theta}_y$ is given by
+% \begin{align*}
+%     P_{\mat{\theta}_y}(\ten{X}\mid Y = y)
+%         &= p_0(\mat{\theta}_y)\exp(\t{\vech(\vec(\ten{X})\t{\vec(\ten{X})})}\mat{\theta}_y) \\
+%         &= h(\ten{X})\exp(\t{\mat{{\eta}}(\mat{\theta}_y)}\mat{t}(\ten{X}) - b(\mat{\theta}_y))
+% \end{align*}
+% where $h(\ten{X}) = 1$ and $b(\mat{\theta}_y) = -\log p_0(\mat{\theta}(\mat{\eta}_y))$.
+% According to the GLM model we get
+% \begin{align*}
+%     \mat{\eta}_{y,1} &\equiv c_1 (\overline{\ten{\eta}}_1 + \ten{F}_y\times_{k\in[r]}\mat{\alpha}_k), &
+%     \mat{\eta}_{y,2} &\equiv c_2 \bigotimes_{k = r}^{1}\mat{\Omega}_k.
+% \end{align*}
+% which yields the following relation to the conditional Ising model parameters
+% \begin{displaymath}
+%     \mat{\theta}_y = \mat{\theta}(\mat{\eta}_y) = \vech(\diag(\mat{\eta}_{y,1}) + (2_{p\times p} - \mat{I}_p) \odot \reshape{(p, p)}(\mat{\eta}_{y,2}))
+% \end{displaymath}
+% where $c_1, c_2$ are non-zero known (modeled) constants between $0$ and $1$ such that $c_1 + c_2 = 1$. The ``inverse'' link in then computed via the Ising model as the conditional expectation of all interactions
+% \begin{align*}
+%     \invlink_2(\mat{\eta}_y) \equiv \E_{\mat{\theta}_y}[\vec(\ten{X})\t{\vec(\ten{X})}\mid Y = y]
+% \end{align*}
+% which incorporates the first moment. In other words $\invlink_1(\mat{\eta}_y) = \diag(\E_{\mat{\theta}_y}[\vec(\ten{X})\t{\vec(\ten{X})}\mid Y = y])$.


 % The ``inverse'' link is given by
@ -565,7 +546,90 @@ Let $X, Y$ be $p, q$ dimensional random variables, respectively. Furthermore, le
                                \section{Proofs}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \begin{proof}[Proof of Theorem~\ref{thm:sdr}]
-    abc
+
+
+
+\end{proof}
+Illustration of dimensions
+\begin{displaymath}
+    \underbrace{ \mat{D}_{p_j}\t{\mat{D}_{p_j}} }_{\makebox[0pt]{\scriptsize $p_j^2\times p_j^2$}}
+    %
+    \underbrace{%
+        \overbrace{\reshape{(\mat{p}, \mat{p})}\!\!\Big(\sum_{i = 1}^n
+            \underbrace{ (\mat{t}_2(\ten{X}_i) - \invlink_2(\mat{\eta}_{y_i}) }_{p^2\times 1}
+        \Big)}^{\substack{\text{(tensor of order $2 r$)}\\p_1\times p_2\times ... \times p_r\times p_1\times p_2\times ... \times p_r}} \!\!\makebox[0pt]{\phantom{\Big)}}_{(j, r + j)}
+    }_{\substack{p_j^2\times (p / p_j)^2\\\text{(matricized / $j^{\text{th}}$ axis as rows)}}}
+    %
+    \underbrace{%
+        \vec \overbrace{ \bigotimes_{\substack{k = r\\k\neq j}}^{1}\mat{\Omega}_j }^{\makebox[0pt]{\scriptsize $(p/p_j)\times (p/p_j)$}}
+    }_{\makebox[0pt]{\scriptsize $(p/p_j)^2\times 1$}}
+\end{displaymath}
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\section{Distributions}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\subsection{Tensor Normal Distribution}
+
+\section{Tensor Normal Distribution}
+Let $\ten{X}$ be a multi-dimensional array random variable of order $r$ with dimensions $p_1\times ... \times p_r$ written as
+\begin{displaymath}
+    \ten{X}\sim\mathcal{TN}(\mu, \mat{\Delta}_1, ..., \mat{\Delta}_r).
+\end{displaymath}
+Its density is given by
+\begin{displaymath}
+    f(\ten{X}) = \Big( \prod_{i = 1}^r \sqrt{(2\pi)^{p_i}|\mat{\Delta}_i|^{p / p_i}} \Big)^{-1}
+        \exp\!\left( -\frac{1}{2}\langle \ten{X} - \mu, (\ten{X} - \mu)\times\{\mat{\Delta}_1^{-1}, ..., \mat{\Delta}_r^{-1}\} \rangle \right)
+\end{displaymath}
+where $p = \prod_{i = 1}^r p_i$. This is equivalent to the vectorized $\vec\ten{X}$ following a Multi-Variate Normal distribution
+\begin{displaymath}
+    \vec{\ten{X}}\sim\mathcal{N}_{p}(\vec{\mu}, \mat{\Delta}_r\otimes...\otimes\mat{\Delta}_1).
+\end{displaymath}
+
+\begin{theorem}[Tensor Normal via Multi-Variate Normal]
+    For a multi-dimensional random variable $\ten{X}$ of order $r$ with dimensions $p_1\times ..., p_r$. Let $\ten{\mu}$ be the mean of the same order and dimensions as $\ten{X}$ and the mode covariance matrices $\mat{\Delta}_i$ of dimensions $p_i\times p_i$ for $i = 1, ..., n$. Then the tensor normal distribution is equivalent to the multi-variate normal distribution by the relation
+    \begin{displaymath}
+        \ten{X}\sim\mathcal{TN}(\mu, \mat{\Delta}_1, ..., \mat{\Delta}_r)
+            \qquad\Leftrightarrow\qquad
+        \vec{\ten{X}}\sim\mathcal{N}_{p}(\vec{\mu}, \mat{\Delta}_r\otimes ...\otimes \mat{\Delta}_1)
+    \end{displaymath}
+    where $p = \prod_{i = 1}^r p_i$.
+\end{theorem}
+\begin{proof}
+    A straight forward way is to rewrite the Tensor Normal density as the density of a Multi-Variate Normal distribution depending on the vectorization of $\ten{X}$. First consider
+    \begin{multline*}
+        \langle \ten{X} - \mu, (\ten{X} - \mu)\times\{\mat{\Delta}_1^{-1}, ..., \mat{\Delta}_r^{-1}\} \rangle
+            = \t{\vec(\ten{X} - \mu)}\vec((\ten{X} - \mu)\times\{\mat{\Delta}_1^{-1}, ..., \mat{\Delta}_r^{-1}\}) \\
+            = \t{\vec(\ten{X} - \mu)}(\mat{\Delta}_r^{-1}\otimes ...\otimes\mat{\Delta}_1^{-1})\vec(\ten{X} - \mu) \\
+            = \t{(\vec\ten{X} - \vec\mu)}(\mat{\Delta}_r\otimes ...\otimes\mat{\Delta}_1)^{-1}(\vec\ten{X} - \vec\mu).
+    \end{multline*}
+    Next, using a property of the determinant of a Kronecker product $|\mat{\Delta}_1\otimes\mat{\Delta}_2| = |\mat{\Delta}_1|^{p_2}|\mat{\Delta}_2|^{p_1}$ yields
+    \begin{displaymath}
+        |\mat{\Delta}_r\otimes...\otimes\mat{\Delta}_1|
+            = |\mat{\Delta}_r\otimes...\otimes\mat{\Delta}_2|^{p_1}|\mat{\Delta}_1|^{p / p_1}
+    \end{displaymath}
+    where $p = \prod_{j = 1}^r p_j$. By induction over $r$ the relation
+    \begin{displaymath}
+        |\mat{\Delta}_r\otimes...\otimes\mat{\Delta}_1|
+            = \prod_{i = 1}^r |\mat{\Delta}_i|^{p / p_i}
+    \end{displaymath}
+    holds for arbitrary order $r$. Substituting into the Tensor Normal density leads to
+    \begin{align*}
+        f(\ten{X}) = \Big( (2\pi)^p |\mat{\Delta}| \Big)^{-1/2}
+        \exp\!\left( -\frac{1}{2}\t{(\vec\ten{X} - \vec\mu)}\mat{\Delta}^{-1}(\vec\ten{X} - \vec\mu) \right)
+    \end{align*}
+    with $\mat{\Delta} = \mat{\Delta}_r\otimes...\otimes\mat{\Delta}_1$ which is the Multi-Variate Normal density of the $p$ dimensional vector $\vec\ten{X}$ with mean $\vec\mu$ and covariance $\mat{\Delta}$.
 \end{proof}

+When sampling from the Multi-Array Normal one way is to sample from the Multi-Variate Normal and then reshaping the result, but this is usually very inefficient because it requires to store the multi-variate covariance matrix which is very big. Instead, it is more efficient to sample $\ten{Z}$ as a tensor of the same shape as $\ten{X}$ with standard normal entries and then transform the $\ten{Z}$ to follow the Multi-Array Normal as follows
+\begin{displaymath}
+    \ten{Z}\sim\mathcal{TN}(0, \mat{I}_{p_1}, ..., \mat{I}_{p_r})
+        \quad\Rightarrow\quad
+    \ten{X} = \ten{Z}\times\{\mat{\Delta}_1^{1/2}, ..., \mat{\Delta}_r^{1/2}\} + \mu\sim\mathcal{TN}(\mu, \mat{\Delta}_1, ..., \mat{\Delta}_r).
+\end{displaymath}
+where the sampling from the standard Multi-Array Normal is done by sampling all of the elements of $\ten{Z}$ from a standard Normal.
+
+
 \end{document}