We consider regression and classification for \textit{general} response and tensor-valued predictors (multi dimensional arrays) and propose a \textit{novel formulation} for sufficient dimension reduction. Assuming the distribution of the tensor-valued predictors given the response is in the quadratic exponential family, we model the natural parameter as a multi-linear function of the response.
This allows per-axis reductions that drastically reduce the total number of parameters for higher order tensor-valued predictors. We derive maximum likelihood estimates for the sufficient dimension reduction and a computationally efficient estimation algorithm which leverages the tensor structure. The performance of the method is illustrated via simulations and real world examples are provided.
Vectors are write as boldface lowercase letters (e.g. $\mat a$, $\mat b$), matrices use boldface uppercase or Greek letters (e.g. $\mat A$, $\mat B$, $\mat\alpha$, $\mat\Delta$). The identity matrix of dimensions $p\times p$ is denoted by $\mat{I}_p$ and the commutation matrix as $\mat{K}_{p, q}$ or $\mat{K}_p$ is case of $p = q$. Tensors, meaning multi-dimensional arrays of order at least 3, use uppercase calligraphic letters (e.g. $\ten{A}$, $\ten{B}$, $\ten{X}$, $\ten{Y}$, $\ten{F}$). Boldface indices (e.g. $\mat{i}, \mat{j}, \mat{k}$) denote multi-indices $\mat{i}=(i_1, ..., i_r)\in[\mat{q}]$ where the bracket notation is a shorthand for $[r]=\{1, ..., r\}$ which in conjunction with a multi-index as argument means $[\mat{q}]=[q_1]\times ... \times[q_K]=\{(i_1, ..., i_r)\in\mathbb{N}^r : 1\leq i_k\leq q_k, \forall k =1, ..., r \}$.
Let $\ten{A}=(a_{i_1,...,i_r})\in\mathbb{R}^{q_1\times ...\times q_r}$ be an order\footnote{Also called rank, therefore the variable name $r$, but this term is \emph{not} used as it leads to confusion with the rank as in ``the rank of a matrix''.}$r$ tensor where $r\in\mathbb{N}$ is the number of modes or axis of $\ten{A}$. For matrices $\mat{B}_k\in\mathbb{R}^{p_k\times q_k}$ with $k\in[r]=\{1, 2, ..., r\}$ the \emph{multi-linear multiplication} is defined element wise as
which results in an order $r$ tensor of dimensions $p_1\times ...\times p_k$. With this the \emph{$k$-mode product} between the tensor $\ten{A}$ with the matrix $\mat{B}_k$ is given by
Furthermore, the notation $\ten{A}\times_{k\in S}$ is a short hand for writing the iterative application if the mode product for all indices in $S\subseteq[r]$. For example $\ten{A}\times_{k\in\{2, 5\}}\mat{B}_k =\ten{A}\times_2\mat{B}_2\times_5\mat{B}_5$. By only allowing $S$ to be a set, this notation is unambiguous, because the mode products commutes for different modes $j\neq k\Rightarrow\ten{A}\times_j\mat{B}_j\times_k\mat{B}_k =\ten{A}\times_k\mat{B}_k\times_j\mat{B}_j$.
with which the \emph{Frobenius Norm}$\|\ten{A}\|_F =\sqrt{\langle\ten{A}, \ten{A}\rangle}$. Of interest is also the \emph{maximum norm}$\|\ten{A}\|_{\infty}=\max_{i_1, ..., i_K} a_{i_1, ..., i_K}$. Furthermore, the Frobenius and maximum norm are also used for matrices while for a vector $\mat{a}$ the \emph{2 norm} is $\|\mat{a}\|_2=\sqrt{\langle\mat{a}, \mat{a}\rangle}$.
Matrices and tensor can be \emph{vectorized} by the \emph{vectorization} operator $\vec$. Sometimes only the order of elements in an object are of interest, therefore we use the notation $\ten{A}\equiv\ten{B}$ for objects $\ten{A}, \ten{B}$ of any shape if and only if $\vec{\ten{A}}=\vec{\ten{B}}$. For tensors of order at least $2$ the \emph{flattening} (or \emph{unfolding} or \emph{matricization}) is a reshaping of the tensor into a matrix along an particular mode. For a tensor $\ten{A}$ of order $r$ and dimensions $q_1, ..., q_r$ the $k$-mode unfolding $\ten{A}_{(k)}$ is a $q_k\times\prod_{l=1, l\neq k}q_l$ matrix. For the tensor $\ten{A}=(a_{i_1,...,i_r})\in\mathbb{R}^{q_1, ..., q_r}$ the elements of the $k$ unfolded tensor $\ten{A}_{(k)}$ are
The rank of a tensor $\ten{A}$ of dimensions $q_1\times ...\times q_r$ is given by a vector $\rank{\ten{A}}=(a_1, ..., a_r)\in[q_1]\times...\times[q_r]$ where $a_k =\rank(\ten{A}_{(k)})$ is the usual matrix rank of the $k$ unfolded tensor.
The main relation for sufficient dimension reduction under an inverse regression setting is given by the following. If $(\ten{X}, Y)$ is jointly distributed and $\mat{R}$ is a measurable function of $\ten{X}$, a.k.a. a statistic. Then
This means that a sufficient statistic $\mat{R}(\ten{X})$ for $Y$ in the inverse regression $\ten{X}\mid Y$ yields a sufficient reduction for $\ten{X}$ in the forward regression $Y\mid\ten{X}$.
Let $(\ten{X}, Y)$ be jointly distributed with $\ten{X}$ be a random order $r$ tensor of dimensions $p_1\times ...\times p_r$ and $Y$ arbitrary as we assume a known tensor valued function of $Y$ denoted $\ten{F}_Y$ and has order $r$ and dimensions $q_1\times ... \times q_r$. Furthermore, $\E\ten{F}_Y =\mat{0}$ from now on and denote $p =\prod_{k =1}^r p_k$ and $q =\prod_{k =1}^r q_k$ as well as $\mat{p}=(p_1, ..., p_r)$ and $\mat{q}=(q_1, ..., q_r)$.
We assume that $\ten{X}\mid Y$ follows a quadratic exponential family distribution. The restriction to a \emph{quadratic} exponential family is to properly model the tensor structure of $\ten{X}$. The pdf (pmf) with parameters $\mat{\theta}_y$ and natural parameters $\mat{\eta}_y =\mat{\eta}(\mat{\theta}_y)$ is given by
Under \eqref{eq:exp-family} the first natural parameter vector $\mat{\eta}_{y,1}$ is a $p$ dimensional vector while the second natural parameter vector $\mat{\eta}_{y,2}$ consists of $p^2$ elements. This is not the most compact form as its possible to only use $p(p +1)/2$ values by using the $\vech(\vec(X)\t{\vec(X)})$ as statistic for the second natural parameter. The reason for choosing $\mat{t}_2(\ten{X})=\vec(\ten{X})\otimes\vec(\ten{X})$ instead is of technical nature as it simplifyes modeling the tensor structure of $\ten{X}$.
In analogy to a GLM we model the natural parameters as a multi-linear function of $\ten{F}_y$, which is a tensor of dimensions $q_1\times ... \times q_r$ of known functions of $y$ such that $\E_Y\ten{F}_Y =0$.
where $\overline{\ten{\eta}}_1$ is a intercept parameter of the same shape as $\ten{X}$ while the linear predictors $\mat{\alpha}_k$, for $k =1, ..., r$, are unconstraint $p_k\times q_k$ dimensional matrices. The scalar $c_1$ is known and eases modeling for specific distributions. The second natural parameter vector $\mat{\eta}_{y,2}$ is assumed to have the form
In a classical GLM we also have an invertible link function $\mat{g}$ connectiong the natrual parameters to the expectation of the exponential family statistis as $\mat{\eta}_y =\mat{g}(\E[\mat{t}(\ten{X})\mid Y = y])$. Such a link may not exist, but for our purposes its suffices to have the ``inverse'' link. The ``inverse'' link $\invlink$ does exist as the natural parameters fully describe the distribution
which we also split into two parts as $\invlink_1(\mat{\eta}_y)=\E[\mat{t}_1(\ten{X})\mid Y = y]\equiv\E[\ten{X}\mid Y = y]$ and $\invlink_2(\mat{\eta}_y)=\E[\mat{t}_2(\ten{X})\mid Y = y]=\E[\vec(\ten{X})\otimes\vec(\ten{X})\mid Y = y]$.
We denote this GLM like model as \emph{generalized multi-linear model} (GMLM).
\begin{theorem}[SDR]\label{thm:sdr}
A sufficient reduction for the regression $Y\mid\ten{X}$ under the quadratic exponential family inverse regression model \eqref{eq:exp-family} is given by
for a $p\times q$ dimensional matrix $\mat{\beta}=\bigotimes_{k = r}^{1}\mat{\alpha}_k$ which satisfies $\Span(\mat{\beta})=\Span(\{\mat{\eta}_{Y,1}-\E_{Y}\mat{\eta}_{Y,1} : Y\in\mathcal{S}_Y\})$.
A straight forward idea for parameter estimation is to use Gradient Descent. For pure algorithmic speedup, by only changing the update rule but \emph{not} the gradient denoted $\nabla_{\mat{\Theta}}l$ of the objective function, we use Nesterov Accelerated Gradient Descent with an internal Line Search to determine the step size in each iteration, Algorithm~\ref{alg:NAGD}.
% Denote with $l(\mat{\Theta}\mid \ten{X}, \ten{F}_y)$ the negative log-likelihood \eqref{eq:likelihood} which is minimization objective from the view of the Algorithm~\ref{alg:NAGD}.
Algorithm~\ref{alg:NAGD} requires initial parameter estimates. Therefore, we choose a simple heuristic approach by starting with $\mat{\Omega}_k^{(0)}=\mat{I}_{p_k}$ and $\overline{\ten{\eta}}_1^{(0)}$ dependent on the particular distribution. For example for a Tensor Normal distribution the sample mean is a good choice while for the Ising model $0$ has nice properties. The interesting part are the multi-linear predictors $\mat{\alpha}_k$. Therefore, let $\mat{\Sigma}_k =\frac{q_k}{n q}\sum_{i =1}^n {\ten{F}_{y_i}}_{(k)}\t{{\ten{F}_{y_i}}_{(k)}}$ and $\mat{\Delta}_k =\frac{p_k}{n p}\sum_{i =1}^n {\ten{X}_{i}}_{(k)}\t{{\ten{X}_{i}}_{(k)}}$ mode wise sample covariance estimates. For both we take the $s_k =\min(p_k, q_k)$ order SVD approximation as $\mat{\Delta}_k \approx\mat{U}_k\mat{D}_k\t{\mat{U}_k}$ and $\mat{\Sigma}_k \approx\mat{V}_k\mat{S}_k\t{\mat{V}_k}$. The approximate SVD gives us the $s_k$ first singular vectors with dimensions $\mat{U}_k(p_k\times s_k), \mat{D}_k(s_k\times s_k)$ and $\mat{V}_k(q_k\times s_k), \mat{S}_k(s_k\times s_k)$. Then we initialize
% An alternative is to use Iterative Re-weighted least Squares, but his requires the Fisher Information which has dimensions $s\times s$ for $s = p + \sum_{k = 1}^r (p_k q_k + p_k^2)$. This means that for the computation of the Fisher Information we have a computational complexity of $\mathcal{O}()$
We illustrate the SDR method on two special cases, first the Tensor Normal distribution and second on the Multi-Variate Bernoulli distribution with vector, matrix and tensor valued predictors.
Let $\ten{X}, \ten{F}_y$ be order $r$ tensors of dimensions $p_1\times ... \times p_r$ and $q_1\times ... \times q_r$, respectively. We assume the inverse regression model for $\ten{X}\mid Y = y$ to be tensor normal distributed with density
with location parameter tensor $\ten{\mu}_y$ depending on $y$ and the symmetric covariance matrices $\mat{\Delta}_{k}$ for each of the $k\in[r]$ modes (independent of $y$) collected in the parameter vector $\mat{\theta}_y =(\ten{\mu}_y, \mat{\Delta}_1, ..., \mat{\Delta}_r)$. Rewriting into the form of an quadratic exponential family leads to
The intercept parameter $\overline{\ten{\eta}}_1$ is of the same dimensions as $\ten{X}$ and the reduction matrices $\mat{\alpha}_j$ are of dimensions $p_j\times q_j$ while the symmetric $\mat{\Omega}_j$ are of dimensions $p_j\times p_j$. The inverse relation from the GLM parameters to the tensor normal parameters is
Note that the diagonal elements of the $\mat{\Omega}_k$ are multiplied by $0$ which means they are ignored. This reflects the fact that under the Ising model (in general for the multivariate Bernoulli) holds $\E\ten{X}_{\mat{j}}\mid Y =\E\ten{X}_{\mat{j}}\ten{X}_{\mat{j}}\mid Y$ due to $0$ and $1$ entries only. Therefore our model over-parameterize as the diagonal elements of $\mat{\Omega}_k$ and $\overline{\ten{\eta}}_1$ serve the same purpose.
$\ten{X}$ is a $2\times3\times5$ tensor, $y\in\{1, 2, ..., 6\}$ uniformly distributed and $\ten{F}_y$ is a $1\times2\times3$ tensor with $0$ or $1$ entries $(\vec{\ten{F}}_y)_j =\delta_{y,j}$. The GMLM parameters are simply $\overline{\ten{\eta}}_1=0$, the matrices $\mat{\alpha}_k$ are filled with independent standard normal entries and the $\Omega_k$ are $AR_{p_k}(0.5)$. The $X \mid Y = y$ was drawn from a Tensor Normal with mean $\mu_y =\ten{F}_y\times_{k\in[3]}\mat{\Omega}_k^{-1}\mat{\alpha}_k$ and co-variances $\mat{\Omega}_k^{-1}$.
We start by computing the \emph{observed information matrix}$\mathcal{J}(\mat{\Theta})$ at parameters $\mat{\Theta}=(\overline{\ten{\eta}}_1$, $\mat{\alpha}_1$, $...$, $\mat{\alpha}_r$, $\mat{\Omega}_1$, $...$, $\mat{\Omega}_r)$ given as a block matrix consisting of $(2 r +1)^2$ blocks (althoug only $(2 r +1)(r +1)$ unique blocks by a block symmetry $\mathcal{J}_{i, j}=\t{\mathcal{J}_{j, i}}$)
For example $\mathcal{J}_{1,2}=-\frac{\partial l(\Theta)}{\partial\t{(\vec{\overline{\ten{\eta}}_1})}\partial(\vec{\mat{\alpha}_1})}$ and $\mathcal{J}_{2r +1, 2r +1}=-\H l(\mat{\Omega}_r)$.
We start by restating the log-likelihood for a given single observation $(\ten{X}, \ten{Y})$ where $\ten{F}_y$ given by
Note that $\ten{D}_{1}$ is of order $r$, $\ten{D}_{2}$ and $\ten{H}_{1,1}$ are tensors of order $2 r$, $\ten{H}_{1,2}$ and $\ten{H}_{2,1}$ have order $3 r$ and $\ten{H}_{2,2}$ is of order $4 r$.
Now we rewrite all the above differentials to extract the derivatives one at a time with extensive usage of the identities from Theorem~\ref{thm:mlm_mat} and Theorem~\ref{thm:mtvk_rearrange}. We start with the first order differentials
This now concludes the computation of the observed information matrix $\mathcal{J}(\mat{\Theta})$ as all blocks (or its transposed counterpart) are computed. To get to the Fisher information we now continue by computing the expectation of the individual blocks $\mathcal{I}_{i,j}=\E[\mathcal{J}_{i,j}\mid\ten{Y}= y]$. Taking a closer look at the individual blocks shows that only a few of them are random as only $\frac{\partial l}{\partial(\vec{\mat{\alpha}_j})\t{\partial(\vec{\mat{\alpha}_l})}}$ and $\frac{\partial l}{\partial(\vec{\mat{\Omega}_j})\t{\partial(\vec{\mat{\Omega}_l})}}$ for $j \neq l$ depend on $\ten{X}$. This means only for those the expectation needs to be calculated because all other blocks are constant with regard to the conditional expectation. Therefore, we start by compute the expectation of some subexpressions encountered in these blocks.
The \emph{matricization} is a generalization of the \emph{vectorization} operation, the \emph{mode product} generalizes the matrix matrix product which is used in the \emph{multi linar multiplication}. In this section we provide some relations between these operations in conjunction with the Kronecker product.
\begin{theorem}\label{thm:mlm_mat}
Let $\ten{A}$ be a tensor of order $r$ with the dimensions $q_1\times ... \times q_r$. Furthermore, let for $k =1, ..., r$ be $\mat{B}_k$ matrices of dimensions $p_k\times q_k$. Then, for any $(\mat{i}, \mat{j})\in\perm{r}$ holds
Here we have a matrix, a.k.a. an order 2 tensor, and the vectorization as a special case of the matricization $\vec{\mat{A}}=\mat{A}_{((1, 2))}$ with $(\mat{i}, \mat{j})=((1, 2), ())\in\perm{2}$. Note that the empty Kronecker product is $1$ by convention.
Before we state another identity of interest which arises in matrix calculus we define a new operation similar in nature to the rearrangement operation in \cite{ApproxKron-VanLoanPitsianis1993}.
\begin{defn}
Let $\ten{A}$ be a $2 r +2 s + t$ tensor where $r > 0$ and $s, t \geq0$ of dimensions $q_1\times ... \times q_{2 r +2 s + t}$. Furthermore, let $(\mat{i}, \mat{j}, \mat{k})\in\perm{2 r +2 s + t}$ such that $\len{\mat{i}}=2 r$ and $\len{\mat{j}}=2 s$ with which the operation $\ten{R}_{\mat{i}, \mat{j}}$ is defined as
where both $\pi, \rho$ are \emph{inverse perfect outer shuffles} and the new dimension $\mat{d}$ is given by
\begin{displaymath}
\mat{d} = (q_{\mat{i}_1} q_{\mat{i}_{r + 1}}, ..., q_{\mat{i}_{r}} q_{\mat{i}_{2 r}}, q_{\mat{j}_1} q_{\mat{j}_{s + 1}}, ..., q_{\mat{j}_s} q_{\mat{j}_{2 s}}, q_{2 r + 2 s + 1}, ..., q_{2 r + 2 s + t}).
\end{displaymath}
In the case of $s =0$ we drop the second index and write $\ten{R}_{\mat{i}}$ cause $\mat{j}$ is empty.
\end{defn}
The operation $\ten{R}_{\mat{i}}(\ten{A})$ results in a tensor of order $r + s$ as the reshape operation collapses pairs of consecutive axis after shuffling. Similar the two index version leads to a tensor of order $r + s + t$.
\begin{center}
\begin{tikzpicture}[>=latex]
\foreach\t in {1,...,13}{
\node at (\t, 2) {$q_{\t}$};
}
% \foreach \card [count=\y, evaluate=\y as \h using {\y / 3.5}] in {4,...,2,A} {
\foreach\t [count=\f] in {1, 3, 5, 2, 4, 6}{
\node[label=above:$\mat{i}_\f$] (f) at (\f, 1) {$\f$};
\node[label=below:$\pi(\mat{i})_\t$] (t) at (\t, -1) {$\f$};
\draw[->] (f) to[out=270, in=90] (t);
}
\begin{scope}[xshift=6cm]
\foreach\t [count=\f, evaluate=\f as \ff using {int(\f + 6)}] in {1, 3, 2, 4}{
\node[label=above:$\mat{j}_\f$] (f) at (\f, 1) {$\ff$};
\node[label=below:$\sigma(\mat{j})_\t$] (t) at (\t, -1) {$\ff$};
\draw[->] (f) to[out=270, in=90] (t);
}
\end{scope}
\begin{scope}[xshift=10cm]
\foreach\t [count=\f, evaluate=\f as \ff using {int(\f + 10)}] in {1, 2, 3}{
\node[label=above:$\mat{k}_\f$] (f) at (\f, 1) {$\ff$};
\node[label=below:$\mat{k}_\t$] (t) at (\t, -1) {$\ff$};
\draw[->] (f) to[out=270, in=90] (t);
}
\end{scope}
\foreach\t [count=\f] in {1,4,2,5,3,6,7,9,8,10,11,12,13}{
\node (d-\f) at (\f, -2) {$q_{\t}$};
}
\foreach\l\r [count=\f] in {1/4,2/5,3/6,7/9,8/10}{
\node[anchor=east] at (0.5, -4) {Dims. $\ten{R}_{\mat{i}, \mat{j}}(\ten{A})$};
\end{tikzpicture}
\end{center}
\begin{theorem}\label{thm:mtvk_rearrange}
Let $\ten{A}$ be a $2 r + s$ tensor where $r > 0$ and $s \geq0$ of dimensions $q_1\times ... \times q_{2 r + s}$. Furthermore, let $(\mat{i}, \mat{j})\in\perm{2 r + s}$ such that $\len{\mat{i}}=2 r$ and for $k =1, ..., r$ denote with $\mat{B}_k$ matrices of dimensions $q_{\mat{i}_{k}}\times q_{\mat{i}_{r + k}}$, then
A special case of above Theorem is given for tensors represented as a Kronecker product. Therefore, let $\mat{A}_k, \mat{B}_k$ be arbitrary matrices of size $p_k\times q_k$ for $k =1, ..., r$ and $\ten{A}=\reshape{(\mat{p}, \mat{q})}\bigotimes_{k = r}^{1}\mat{A}_k$. Then Theorem~\ref{thm:mtvk_rearrange} specializes to
% Next we define a specific axis permutation and reshaping operation on tensors which will be helpful in expressing some derivatives. Let $\ten{A}$ be a $2 r + s$ tensor with $r > 0$ and $s\geq 0$ of dimensions $p_1\times ... \times p_{2 r + s}$. Furthermore, let $(\mat{i}, \mat{j})\in\perm{2 r + s}$ such that $\len{\mat{i}} = 2 r$. The operation $\ten{R}_{\mat{i}}$ is defined as
% The operation $\ten{R}_{\mat{i}}$ is now defined in two steps. First, the axis of $\ten{A}$ are permuted by $(\pi(\mat{i}), \mat{j})$ where $\pi(\mat{i})$ is the perfect outer shuffle of $\mat{i}$. Second, the first $2 r$ axis are collapsed into $r$ axis via reshaping by vectorizing consecutive axis pairs. \todo{Explane, define (understandable)!} Using the operation $\ten{R}_{\mat{i}}$ we get the relation
% where $\ttt$ is the tensor times tensor product.
% Another useful identity for two tensors $\ten{A}, \ten{B}$ of dimensions $p_1\times ... \times p_{j-1}\times q_j\times p_{j+1}\times p_r$ and $p_1\times ... \times p_{k-1}\times q_k\times p_{k+1}\times p_r$, respectively. Furthermore, we have an order $2 r$ tensor $\ten{H}$ of dimensions $(\mat{p}, \mat{p})$ and two matrices $\mat{C}, \mat{D}$ of dimensions $p_j\times q_j$ and $p_k\times q_k$, then
% &= \t{(\vec{\mat{C}})} \reshape{(p_j q_j, p_k q_k)}(\reshape{(p_j, \mat{p}, q_j)}(\ten{H}_{(j, [r] + r)}\t{\ten{A}_{(j)}})_{(1, r + 2, k + 1)}\t{\ten{B}_{(k)}}) \vec{\mat{D}}
% \end{align*}
% Let $\ten{A}$ be an order $r$ tensor and $\mat{B}$ of dimensions $p_j\times q$ a matching matrix such that $\ten{A}\times_j\mat{B}$ is of dimensions $p_1\times...\times p_r$. Furthermore, let $\ten{H}$ be a tensor of dimensions $p_1\times...\times p_r\times d_1\times ... \times d_s$ which makes it an order $r + s$ tensor, then
% \begin{displaymath}
% \t{\vec(\ten{A}\times_j\mat{B})}\ten{H}_{([r])} = \t{(\vec{\mat{B}})}(\ten{H}\ttt_{[r]\backslash j}\ten{A})_{(1, s + 2)}.
% \end{displaymath}
% Next we define a specific axis permutation and reshaping operation on tensors which will be helpful in expressing some derivatives. Let $\ten{A}$ be a $2 r + s$ tensor $\ten{A}$ with $r > 0$ and $s\geq 0$ of dimensions $p_1\times ... \times p_{2 r + s}$. Furthermore, let $(\mat{i}, \mat{j})\in\perm{2 r + s}$ such that $\len{\mat{i}} = 2 r$, $\len{\mat{j}} = s$ and $\mat{j}$ also sorted. The operation $\ten{R}_{\mat{i}}$ is now defined in two steps. First, the axis of $\ten{A}$ are permuted by $(\pi(\mat{i}), \mat{j})$ where $\pi(\mat{i})$ is the perfect outer shuffle of $\mat{i}$. Second, the first $2 r$ axis are collapsed into $r$ axis via reshaping by vectorizing consecutive axis pairs. \todo{Explane, define (understandable)!} Using the operation $\ten{R}_{\mat{i}}$ we get the relation
% Let $\ten{A}$ be a $p_1\times ... \times p_r\times q_1\times ... \times q_r\times d_1\times ...\times d_s$ tensor and $\mat{B}_k$ be $p_k\times q_k$ matrices, then
% where $\ten{R}$ is a permutation of the axis and reshaping of the tensor $\ten{A}$. This axis permutation converts $n\times p_1\times ... \times p_r\times q_1\times ... \times q_r$ to $n\times p_1\times q_1 \times ... \times p_r\times q_r$ and the reshaping vectorizes the axis pairs $p_k\times q_k$ leading to a tensor $\ten{R}(\ten{A})$ of dimensions $n\times p_1 q_1\times ...\times p_r q_r$.
% An alternative way to write this is for each of the $i\in[n]$ vector components is
The \emph{duplication matrix}$\mat{D}_p$ of dimensions $p^2\times p(p +1)/2$ is defined implicitly such that for any symmetric $p\times p$ matrix $\mat{A}$ holds
\begin{displaymath}
\mat{D}_p\vech\mat{A} = \vec{\mat{A}}.
\end{displaymath}
Let $\mat{A}$ by a $p\times q$ matrix, then we denote the \emph{commutation matrix}$\mat{K}_{p,q}$ as the $p q\times p q$ matrix satisfying
\begin{displaymath}
\mat{K}_{p,q}\vec\mat{A} = \vec{\t{\mat{A}}}.
\end{displaymath}
The identity giving the commutation matrix its name is
For a generalization of the commutation matrix let $\ten{A}$ be a $p_1\times ...\times p_r$ tensor of order $r$. Then the \emph{generalized commutation matrix}$\mat{K}_{(p_1, ..., p_r),(j)}$ is implicitly defined such that
for every $j \in[r]$ mode. This is a direct generalization of the commutation matrix with the special case $\mat{K}_{(p,q),(2)}=\mat{K}_{p,q}$ and the trivial case $\mat{K}_{(p_1, ..., p_r),(1)}=\mat{I}_{p}$ for $p =\prod_{j=1}^r p_j$. Furthermore, with a dimension vector $\mat{p}=(p_1, ..., p_r)$ its convenient to write $\mat{K}_{(p_1, ..., p_r),(j)}$ as $\mat{K}_{\mat{p},(j)}$. Its relation to the classic Commutation matrix is given by
for arbitrary matrices $\mat{A}_k$ of dimensions $p_k\times q_k$, $k \in[r]$ which are collected in the dimension vectors $\mat{p}=(p_1, ..., p_r)$ and $\mat{q}=(q_1, ..., q_r)$. Next the \emph{symmetrizer}$\mat{N}_p$ is a $p^2\times p^2$ matrix such that for any $p\times p$ matrix $\mat{A}$
Another matrix which might come in handy is the \emph{selection matrix}$\mat{S}_p$ of dimensions $p^2\times p$ which selects the diagonal elements of a $p\times p$ matrix $\mat{A}$ from its vectorization
\begin{displaymath}
\mat{S}_p\vec{\mat{A}} = \diag{\mat{A}}
\end{displaymath}
where $\diag{\mat{A}}$ denotes the vector of diagonal elements of $\mat{A}$.
For two matrices $\mat A$ of dimensions $a_1\times a_2$ and $\mat B$ of dimensions $b_1\times b_2$ holds
\begin{equation}\label{eq:vecKron}
\vec(\mat A\otimes\mat B) = (\mat{I}_{a_2}\otimes\mat{K}_{b_2,a_1}\otimes\mat{I}_{b_1})(\vec\mat A\otimes\vec\mat B).
\begin{example}[Derivative with respect to a symmertic matrix]
Suppose we have a function $\mat{F}(\mat{X})$ where $\mat{X}$ is a symmetric $p\times p$ matrix. We want to find the derivative of $\mat{F}$ with respect to $\mat{X}$ under the symmetry constraint. As we know that for symmetric $\mat{X}$ holds $\vec{X}=\mat{D}_p\vech{X}$ we get in conjunction of $\mat{F}$ being only a function of $\vech{X}$ the differential as
Via an explicit indexing of the half vectorization and the vectorization operation in reference to the matrix indices of $\mat{X}$ we get with $i(k, l)\in[p(p +1)/2]$ and $j(m, n)\in[p^2]$ where $1\leq l\leq k \leq p$ and $m, n\in[p]$ the mapping
\begin{align*}
i(k, l) &= (k - 1)(p + 1) + 1 - \frac{k(k - 1)}{2} + l - k \\
1 &\text{iff }(k,l) = (m,n)\text{ or }(k,l) = (n,m) \\
0 &\text{else}
\end{cases}.
\end{displaymath}
This reveals already its explicit value as the transposed dublication matrix $\mat{D}_p$ matches the same pattern, meaning
\begin{displaymath}
\D(\vech{\mat{X}})(\mat{X}) = \t{\mat{D}_p}.
\end{displaymath}
This means that from the differential $\d\mat{F}(\mat{Y})=\mat{G}(\mat{Y})\vec{\d\mat{Y}}$ where $\mat{Y}$ is an arbitrary square matrix we can identify the derivative of $\mat{F}$ with respect to a symmetric $\mat{X}$ as
We want to find the derivative with respect to any of the $r$ symmetric $p_j\times p_j$ matrices $\mat{\Omega}_j$ where $j =1, ..., r$ of the Kronecker product
which slightly simplifies the following. With this notation we have $p =\overline{p}_jp_j\underline{p}_j$ for any of the $j =1, ..., r$. Furthermore, the matrices $\overline{\mat{\Omega}}_j$ and $\underline{\mat{\Omega}}_j$ are of dimensions $\overline{p}_j\times\overline{p}_j$ and $\underline{p}_j\times\underline{p}_j$, respectively. We start with the differential
Let $X, Y$ be $p, q$ dimensional random variables, respectively. Furthermore, let $\E X =\mu_X$, $\E Y =\mu_Y$ as well as $\cov(X)=\mat{\Sigma}_X$ and $\cov(Y)=\mat{\Sigma}_Y$. Then define the standardized random variables $Z_X =\mat{\Sigma}_X^{-1/2}(X -\mu_X)$ and $Z_Y =\mat{\Sigma}_Y^{-1/2}(Y -\mu_Y)$. For the standardized variables holds $\E Z_X =0_p$, $\E_Y =0_q$ and for the co-variances we get $\cov(Z_X)=\mat{I}_p$ as well as $\cov(Z_Y)=\mat{I}_q$. Now we take a look at the cross-covariance between $X$ and $Y$
is a sufficient reduction under the exponential family \eqref{eq:exp-family} where $\mat{B}\in\mathbb{R}^{p(p +1)\times q}$ with $\Span(\mat{B})=\Span(\{\mat{\eta}_Y -\E_{Y}\mat{\eta}_Y : Y\in\mathcal{S}_Y\})$. With $\E_Y\mat{\eta}_{Y,1}\equiv c_1\E[\overline{\ten{\eta}}_1-\ten{F}_Y\times_{k\in[r]}\mat{\alpha}_k]= c_1\overline{\ten{\eta}}_1$ cause $\E_Y\ten{F}_Y =0$ and $\mat{\eta}_{y,2}$ does not depend on $y$ (regardless of the notation) we get
Now note $\Span(\mat{A})=\Span(c \mat{A})$ for any matrix $\mat{A}$ and non-zero scalar $c$ as well as the definition $\mat{t}_1(\ten{X})=\vec{\ten{X}}$ which concludes the proves.
\begin{theorem}[Tensor Normal via Multi-Variate Normal]
For a multi-dimensional random variable $\ten{X}$ of order $r$ with dimensions $p_1\times ..., p_r$. Let $\ten{\mu}$ be the mean of the same order and dimensions as $\ten{X}$ and the mode covariance matrices $\mat{\Delta}_i$ of dimensions $p_i\times p_i$ for $i =1, ..., n$. Then the tensor normal distribution is equivalent to the multi-variate normal distribution by the relation
A straight forward way is to rewrite the Tensor Normal density as the density of a Multi-Variate Normal distribution depending on the vectorization of $\ten{X}$. First consider
Next, using a property of the determinant of a Kronecker product $|\mat{\Delta}_1\otimes\mat{\Delta}_2| = |\mat{\Delta}_1|^{p_2}|\mat{\Delta}_2|^{p_1}$ yields
with $\mat{\Delta}=\mat{\Delta}_r\otimes...\otimes\mat{\Delta}_1$ which is the Multi-Variate Normal density of the $p$ dimensional vector $\vec\ten{X}$ with mean $\vec\mu$ and covariance $\mat{\Delta}$.
When sampling from the Multi-Array Normal one way is to sample from the Multi-Variate Normal and then reshaping the result, but this is usually very inefficient because it requires to store the multi-variate covariance matrix which is very big. Instead, it is more efficient to sample $\ten{Z}$ as a tensor of the same shape as $\ten{X}$ with standard normal entries and then transform the $\ten{Z}$ to follow the Multi-Array Normal as follows