Vectors are write as boldface lowercase letters (e.g. $\mat a$, $\mat b$), matrices use boldface uppercase or Greek letters (e.g. $\mat A$, $\mat B$, $\mat\alpha$, $\mat\Delta$). The identity matrix of dimensions $p\times p$ is denoted by $\mat{I}_p$ and the commutation matrix as $\mat{K}_{p, q}$ or $\mat{K}_p$ is case of $p = q$. Tensors, meaning multi-dimensional arrays of order at least 3, use uppercase calligraphic letters (e.g. $\ten{A}$, $\ten{B}$, $\ten{X}$, $\ten{Y}$, $\ten{F}$). Boldface indices (e.g. $\mat{i}, \mat{j}, \mat{k}$) denote multi-indices $\mat{i} = (i_1, ..., i_r)\in[\mat{d}]$ where the bracket notation is a shorthand for $[r] = \{1, ..., r\}$ which in conjunction with a multi-index as argument means $[\mat{d}] = [d_1]\times ... \times[d_K]$.
Let $\ten{A} = (a_{i_1,...,i_r})\in\mathbb{R}^{d_1\times ...\times d_r}$ be an order\footnote{Also called rank, therefore the variable name $r$, but this term is \emph{not} used as it leads to confusion with the rank as in ``the rank of a matrix''.} $r$ tensor where $r\in\mathbb{N}$ is the number of modes or axis of $\ten{A}$. For matrices $\mat{B}_k\in\mathbb{R}^{p_k\times d_k}$ with $k\in[r] = \{1, 2, ..., r\}$ the \emph{multi-linear multiplication} is defined element wise as
(\ten{A}\times\{\mat{B}_1, ..., \mat{B}_r\})_{j_1, ..., j_r} = \sum_{i_1, ..., i_r = 1}^{d_1, ..., d_r} a_{i_1, ..., i_r}(B_{1})_{j_1, i_1} \cdots (B_{r})_{j_r, i_r}
which results in an order $r$ tensor of dimensions $p_1\times ...\times p_k)$. With this the \emph{$k$-mode product} between the tensor $\ten{A}$ with the matrix $\mat{B}_k$ is given by
\mat{A}\times_k\mat{B}_k = \ten{A}\times\{\mat{I}_{d_1}, ..., \mat{I}_{d_{k-1}}, \mat{B}_{k}, \mat{I}_{d_{k+1}}, ..., \mat{I}_{d_r}\}.
Furthermore, the notation $\ten{A}\times_{k\in S}$ is a short hand for writing the iterative application if the mode product for all indices in $S\subset[r]$. For example $\ten{A}\times_{k\in\{2, 5\}}\mat{B}_k = \ten{A}\times_2\mat{B}_2\times_5\mat{B}_5$. By only allowing $S$ to be a set this notation is unambiguous because the mode products commutes for different modes $j\neq k\Rightarrow\ten{A}\times_j\mat{B}_j\times_k\mat{B}_k = \ten{A}\times_k\mat{B}_k\times_j\mat{B}_j$.
The \emph{inner product} between two tensors of the same order and dimensions is
\langle\ten{A}, \ten{B}\rangle = \sum_{i_1, ..., i_r} a_{i_1, ..., i_r}b_{i_1, ..., i_r}
with which the \emph{Frobenius Norm} $\|\ten{A}\|_F = \sqrt{\langle\ten{A}, \ten{A}\rangle}$. Of interest is also the \emph{maximum norm} $\|\ten{A}\|_{\infty} = \max_{i_1, ..., i_K} a_{i_1, ..., i_K}$. Furthermore, the Frobenius and maximum norm are also used for matrices while for a vector $\mat{a}$ the \emph{2 norm} is $\|\mat{a}\|_2 = \sqrt{\langle\mat{a}, \mat{a}\rangle}$.
Matrices and tensor can be \emph{vectorized} by the \emph{vectorization} operator $\vec$. For tensors of order at least $2$ the \emph{flattening} (or \emph{unfolding} or \emph{matricization}) is a reshaping of the tensor into a matrix along an particular mode. For a tensor $\ten{A}$ of order $r$ and dimensions $d_1, ..., d_r$ the $k$-mode unfolding $\ten{A}_{(k)}$ is a $d_k\times \prod_{l=1, l\neq k}d_l$ matrix. For the tensor $\ten{A} = (a_{i_1,...,i_r})\in\mathbb{R}^{d_1, ..., d_r}$ the elements of the $k$ unfolded tensor $\ten{A}_{(k)}$ are
(\ten{A}_{(k)})_{i_k, j} = a_{i_1, ..., i_r}\quad\text{ with }\quad j = 1 + \sum_{\substack{l = 1\\l \neq k}}^r (i_l - 1) \prod_{\substack{m = 1\\m\neq k}}^{l - 1}d_m.
The rank of a tensor $\ten{A}$ of dimensions $d_1\times ...\times d_r$ is given by a vector $\rank{\ten{A}} = (a_1, ..., a_r)\in[d_1]\times...\times[d_r]$ where $a_k = \rank(\ten{A}_{(k)})$ is the usual matrix rank of the $k$ unfolded tensor.
\section{Tensor Normal Distribution}
\todo{The group of orthogonas matrices $O(p)$ of dim $p\times p$, where $O(p, q)$ are the $p\times q$ matrices (a.k.a. the Stiefel manifold)}
\section{Tensor Normal Distribution}
Let $\ten{X}$ be a multi-dimensional array random variable of order $r$ with dimensions $p_1\times ... \times p_r$ written as
\ten{X}\sim\mathcal{TN}(\mu, \mat{\Delta}_1, ..., \mat{\Delta}_r).
Its density is given by
f(\ten{X}) = \Big( \prod_{i = 1}^r \sqrt{(2\pi)^{p_i}|\mat{\Delta}_i|^{p / p_i}} \Big)^{-1}
\exp\!\left( -\frac{1}{2}\langle \ten{X} - \mu, (\ten{X} - \mu)\times\{\mat{\Delta}_1^{-1}, ..., \mat{\Delta}_r^{-1}\} \rangle \right)
where $p = \prod_{i = 1}^r p_i$. This is equivalent to the vectorized $\vec\ten{X}$ following a Multi-Variate Normal distribution
\vec{\ten{X}}\sim\mathcal{N}_{p}(\vec{\mu}, \mat{\Delta}_r\otimes...\otimes\mat{\Delta}_1).
\begin{theorem}[Tensor Normal to Multi-Variate Normal equivalence]
For a multi-dimensional random variable $\ten{X}$ of order $r$ with dimensions $p_1\times ..., p_r$. Let $\ten{\mu}$ be the mean of the same order and dimensions as $\ten{X}$ and the mode covariance matrices $\mat{\Delta}_i$ of dimensions $p_i\times p_i$ for $i = 1, ..., n$. Then the tensor normal distribution is equivalent to the multi-variate normal distribution by the relation
\ten{X}\sim\mathcal{TN}(\mu, \mat{\Delta}_1, ..., \mat{\Delta}_r)
\vec{\ten{X}}\sim\mathcal{N}_{p}(\vec{\mu}, \mat{\Delta}_r\otimes ...\otimes \mat{\Delta}_1)
where $p = \prod_{i = 1}^r p_i$.
A straight forward way is to rewrite the Tensor Normal density as the density of a Multi-Variate Normal distribution depending on the vectorization of $\ten{X}$. First consider
\langle \ten{X} - \mu, (\ten{X} - \mu)\times\{\mat{\Delta}_1^{-1}, ..., \mat{\Delta}_r^{-1}\} \rangle
&= \t{\vec(\ten{X} - \mu)}\vec((\ten{X} - \mu)\times\{\mat{\Delta}_1^{-1}, ..., \mat{\Delta}_r^{-1}\}) \\
&= \t{\vec(\ten{X} - \mu)}(\mat{\Delta}_r^{-1}\otimes ...\otimes\mat{\Delta}_1^{-1})\vec(\ten{X} - \mu) \\
&= \t{(\vec\ten{X} - \vec\mu)}(\mat{\Delta}_r\otimes ...\otimes\mat{\Delta}_1)^{-1}(\vec\ten{X} - \vec\mu).
Next, using a property of the determinant of a Kronecker product $|\mat{\Delta}_1\otimes\mat{\Delta}_2| = |\mat{\Delta}_1|^{p_2}|\mat{\Delta}_2|^{p_1}$ yields
= |\mat{\Delta}_r\otimes...\otimes\mat{\Delta}_2|^{p_1}|\mat{\Delta}_1|^{p / p_1}
where $p = \prod_{j = 1}^r p_j$. By induction over $r$ the relation
= \prod_{i = 1}^r |\mat{\Delta}_i|^{p / p_i}
holds for arbitrary order $r$. Substituting into the Tensor Normal density leads to
f(\ten{X}) = \Big( (2\pi)^p |\mat{\Delta}_r\otimes...\otimes\mat{\Delta}_1| \Big)^{-1/2}
\exp\!\left( -\frac{1}{2}\t{(\vec\ten{X} - \vec\mu)}(\mat{\Delta}_r\otimes ...\otimes\mat{\Delta}_1)^{-1}(\vec\ten{X} - \vec\mu) \right)
which is the Multi-Variate Normal density of the $p$ dimensional vector $\vec\ten{X}$.
When sampling from the Multi-Array Normal one way is to sample from the Multi-Variate Normal and then reshaping the result, but this is usually very inefficient because it requires to store the multi-variate covariance matrix which is very big. Instead, it is more efficient to sample $\ten{Z}$ as a tensor of the same shape as $\ten{X}$ with standard normal entries and then transform the $\ten{Z}$ to follow the Multi-Array Normal as follows
\ten{Z}\sim\mathcal{TN}(0, \mat{I}_{p_1}, ..., \mat{I}_{p_r})
\ten{X} = \ten{Z}\times\{\mat{\Delta}_1^{1/2}, ..., \mat{\Delta}_r^{1/2}\} + \mu\sim\mathcal{TN}(\mu, \mat{\Delta}_1, ..., \mat{\Delta}_r).
where the sampling from the standard Multi-Array Normal is done by sampling all of the elements of $\ten{Z}$ from a standard Normal.
\section{Kronecker Covariance Structure Model}\label{sec:kron_cov}
\begin{tabular}{l l l}
A traight forward idea for parameter estimation is to use Gradient Descent. For pure algorithmic speedup, by only changin the update rule but \emph{not} the gradient computation of the objective function, we use Nesterov Accelerated Gradient Descent described in Section~\ref{sec:alg_gradient_descent}. An alternative approach applicable for all the methods is to resolve the cross dependence in the estimator equation systems by assuming all the other estimators to be fixed. This leads to an artificialy created closed form solution for the current estimate which is computed according the closed form solution. By cyclic iterating through all the parameters and iterating this process till convergence we get an alternative method as described in Section~\ref{sec:alg_iterative_updating}. In both cases initial estimates are needed for starting the iterative process which is the subject of Section~\ref{sec:alg_init}.
In Section~\ref{sec:kron_cov} we derived for different objective functions, meaning parameterized functions as minimization target, the gradients. In Section~\ref{sec:ls} the objective function is the Frobenius norm of the estimated residuals for solving the Least Squares problem, then in Section~\ref{sec:mle} its the log-likelihood to find the MLE estimates and in Section~\ref{sec:approx} we had a pseduo log-likelihood. Regardles of which estimates we want to find, denote with $l$ the minimization objective corresponding to the desired minimization problem with parameters $\mat{\Theta}$ collecting all the parameters of the objective. The classic gradient descent algorithm starts with initial values $\mat{\Theta}^{(0)}$, see Section~\ref{sec:alg_init}, and applies gradient updates with a given learning rate $\delta > 0$ untill convergence. The algorithm used is an extention of the classic Gradient Descent algorithm namely Nesterov Accelerated Gradient Descent. This algorithm performs similar to Gradient Descent gradient updates but before evaluation of the gradient an extrapolation of the current position into the previous step direction is performed. Furthermore, an internal line search loop is used to determin an appropriate step size. See Algorithm~\ref{alg:gd} for the case of joint parameter matrices $\widehat{\mat{\alpha}}_1, ..., \widehat{\mat{\alpha}}_r$ and covariances $\widehat{\mat{\Delta}}_1, ..., \widehat{\mat{\Delta}}_r$ estimation. In case that the parameter matrices and the covariances are \emph{not} estimated together, like in the LS estimation, the parameter vector $\mat{\Theta}$ consists only of the parameter matrices $\mat{\alpha}_j$ (which is the only difference) and at the end of the algorithm the estimated parameter matrices can be used for estimation of the covariances.
A refined version would be to always take the newest estimates. In the case of Algorithm~\ref{alg:iterative_updating} this means that when computing $\widehat{\mat{\alpha}}_j^{(t + 1)}$ we use $\widehat{\mat{\alpha}}_k^{(t + 1)}$ for $k = 1, ..., j - 1$ and $\widehat{\mat{\alpha}}_k^{(t)}$ for $k = j + 1, ..., r$ in line \ref{alg:iterative_updating:update} instead.

Furthermore, there is also the idea of randomizing the updating order which seems improve convergence and kind of stabalizes the algorithm.
%%% Bib and Index %%%
% \printindex
%%% Appendix %%%
\section{Matrix Differential Rules}
Let $\mat A$ be a square matrix (and invertible if needed) and $|.|$ stands for the determinant
\d\log\mat A &= \frac{1}{|\mat A|}\d\mat{A} \\
\d|\mat A| &= |\mat A|\tr \mat{A}^{-1}\d\mat A \\
\d\log|\mat A| &= \tr\mat{A}^{-1}\d\mat A \\
\d\mat{X}^{-1} &= -\mat{X}^{-1}(\d\mat{X})\mat{X}^{-1}
\section{Useful Matrix Identities}
In this section we summarize a few useful matrix identities, for more details see for example \cite{MatrixAlgebra-AbadirMagnus2005}.
For two matrices $\mat A$ of dimensions $a_1\times a_2$ and $\mat B$ of dimensions $b_1\times b_2$ holds
\mat{K}_{b_1, a_1}(\mat{A}\otimes\mat{B})\mat{K}_{a_2, b_2} = \mat{B}\otimes\mat{A}
as well as
\vec(\mat A\kron\mat B) = (\mat{I}_{a_2}\kron\mat{K}_{b_2,a_1}\kron\mat{I}_{b_1})(\vec\mat A\kron\vec\mat B).
Let $\mat A$ be a $p\times p$ dimensional non-singular matrix. Furthermore, let $\mat a, \mat b$ be $p$ vectors such that $\t{\mat b}A^{-1}\mat a\neq -1$, then
(\mat A + \mat a\t{\mat b})^{-1} = \mat{A}^{-1} - \frac{1}{1 + \t{\mat b}A^{-1}\mat a}\mat{A}^{-1}\mat{a}\t{\mat{b}}\mat{A}^{-1}
as well as
\det(\mat A + \mat a\t{\mat b}) = \det(\mat A)(1 + \t{\mat b}{\mat A}^{-1}\mat a)
which even holds in the case $\t{\mat b}\mat{A}^{-1}\mat a = -1$. This is known as Sylvester's determinant theorem.
\section{Commutation Matrix and Permutation Identities}
Note: In this section we use 0-indexing for the sake of simplicity!
In this section we summarize relations between the commutation matrix and corresponding permutation. We also list some extensions to ``simplify'' or represent some term. This is mostly intended for implementation purposes and understanding of terms occurring in the computations.
Let $\mat A$ be an arbitrary $a_1\times a_2$ matrix. The permutation matrix $\mat K_{a_1, a_2}$ satisfies
\mat{K}_{a_1, a_2}\vec{\mat{A}} = \vec{\t{\mat{A}}} \quad\Leftrightarrow\quad (\vec{\mat{A}})_{\pi_{p, q}(i)} = (\vec{\t{\mat{A}}})_{i}, \quad\text{for } i = 0, ..., p q - 1
where $\pi_{p, q}$ is a permutation of the indices $i = 0, ..., p q - 1$ such that
\pi_{p, q}(i + j p) = j + i q, \quad\text{for }i = 0, ..., p - 1; j = 0, ..., q - 1.
\begin{tabular}{l c l}
$\mat{K}_{p, q}$ & $\hat{=}$ & $\pi_{p, q}(i + j p) = j + i q$ \\
$\mat{I}_r\kron\mat{K}_{p, q}$ & $\hat{=}$ & $\tilde{\pi}_{p, q, r}(i + j p + k p q) = j + i q + k p q$ \\
$\mat{K}_{p, q}\kron\mat{I}_r$ & $\hat{=}$ & $\hat{\pi}_{p, q, r}(i + j p + k p q) = r(j + i q) + k$
\caption{\label{tab:commutation-permutation}Commutation matrix terms and corresponding permutations. Indices are all 0-indexed with the ranges; $i = 0, ..., p - 1$, $j = 0, ..., q - 1$ and $k = 0, ..., r - 1$.}
\section{Matrix and Tensor Operations}
The \emph{Kronecker product}\index{Operations!Kronecker@$\kron$ Kronecker product} is denoted as $\kron$ and the \emph{Hadamard product} uses the symbol $\circ$. We also need the \emph{Khatri-Rao product}\index{Operations!KhatriRao@$\hada$ Khatri-Rao product}
$\hada$ as well as the \emph{Transposed Khatri-Rao product} $\odot_t$ (or \emph{Face-Splitting product}). There is also the \emph{$n$-mode Tensor Matrix Product}\index{Operations!ttm@$\ttm[n]$ $n$-mode tensor product} denoted by $\ttm[n]$ in conjunction with the \emph{$n$-mode Matricization} of a Tensor $\mat{T}$ written as $\mat{T}_{(n)}$, which is a matrix. See below for definitions and examples of these operations.\todo{ Definitions and Examples}
\todo{ resolve confusion between Khatri-Rao, Column-wise Kronecker / Khatri-Rao, Row-wise Kronecker / Khatri-Rao, Face-Splitting Product, .... Yes, its a mess.}
\paragraph{Kronecker Product $\kron$:}
\paragraph{Khatri-Rao Product $\hada$:}
\paragraph{Transposed Khatri-Rao Product $\odot_t$:} This is also known as the Face-Splitting Product and is the row-wise Kronecker product of two matrices. If relates to the Column-wise Kronecker Product through
\t{(\mat{A}\odot_{t}\mat{B})} = \t{\mat{A}}\hada\t{\mat{B}}
\paragraph{$n$-mode unfolding:} \emph{Unfolding}, also known as \emph{flattening} or \emph{matricization}, is an reshaping of a tensor into a matrix with rearrangement of the elements such that mode $n$ corresponds to columns of the result matrix and all other modes are vectorized in the rows. Let $\ten{T}$ be a tensor of order $m$ with dimensions $t_1\times ... \times t_n\times ... \times t_m$ and elements indexed by $(i_1, ..., i_n, ..., i_m)$. The $n$-mode flattening, denoted $\ten{T}_{(n)}$, is defined as a $(t_n, \prod_{k\neq n}t_k)$ matrix with element indices $(i_n, j)$ such that $j = \sum_{k = 1, k\neq n}^m i_k\prod_{l = 1, l\neq n}^{k - 1}t_l$.
\todo{ give an example!}
\paragraph{$n$-mode Tensor Product $\ttm[n]$:}
The \emph{$n$-mode tensor product} $\ttm[n]$ between a tensor $\mat{T}$ of order $m$ with dimensions $t_1\times t_2\times ... \times t_n\times ... \times t_m$ and a $p\times t_n$ matrix $\mat{M}$ is defined element-wise as
(\ten{T}\ttm[n] \mat{M})_{i_1, ..., i_{n-1}, j, i_{n+1}, ..., i_m} = \sum_{k = 1}^{t_n} \ten{T}_{i_1, ..., i_{n-1}, k, i_{n+1}, ..., i_m} \mat{M}_{j, k}
where $i_1, ..., i_{n-1}, i_{n+1}, ..., i_m$ run from $1$ to $t_1, ..., t_{n-1}, t_{n+1}, ..., t_m$, respectively. Furthermore, the $n$-th fiber index $j$ of the product ranges from $1$ to $p$. This gives a new tensor $\mat{T}\ttm[n]\mat{M}$ of order $m$ with dimensions $t_1\times t_2\times ... \times p\times ... \times t_m$.
\begin{example}[Matrix Multiplication Analogs]
Let $\mat{A}$, $\mat{B}$ be two matrices with dimensions $t_1\times t_2$ and $p\times q$, respectively. Then $\mat{A}$ is also a tensor of order $2$, now the $1$-mode and $2$-mode products are element wise given by
(\mat{A}\ttm[1] \mat{B})_{i,j} &= \sum_{l = 1}^{t_1} \mat{A}_{l,j}\mat{B}_{i,l}
= (\mat{B}\mat{A})_{i,j}
& \text{for }t_1 = q, \\
(\mat{A}\ttm[2] \mat{B})_{i,j} &= \sum_{l = 1}^{t_2} \mat{A}_{i,l}\mat{B}_{j,l}
= (\mat{A}\t{\mat{B}})_{i,j} = \t{(\mat{B}\t{\mat{A}})}_{i,j}
& \text{for }t_2 = q.
In other words, the $1$-mode product equals $\mat{A}\ttm[1] \mat{B} = \mat{B}\mat{A}$ and the $2$-mode is $\mat{A}\ttm[2] \mat{B} = \t{(\mat{B}\t{\mat{A}})}$ in the case of the tensor $\mat{A}$ being a matrix.
\begin{example}[Order Three Analogs]
Let $\mat{A}$ be a tensor of the form $t_1\times t_2\times t_3$ and $\mat{B}$ a matrix of dimensions $p\times q$, then the $n$-mode products have the following look
(\mat{A}\ttm[1]\mat{B})_{i,j,k} &= \sum_{l = 1}^{t_1} \mat{A}_{l,j,k}\mat{B}_{i,l} & \text{for }t_1 = q, \\
(\mat{A}\ttm[2]\mat{B})_{i,j,k} &= \sum_{l = 1}^{t_2} \mat{A}_{i,l,k}\mat{B}_{j,l} \equiv (\mat{B}\mat{A}_{i,:,:})_{j,k} & \text{for }t_2 = q, \\
(\mat{A}\ttm[3]\mat{B})_{i,j,k} &= \sum_{l = 1}^{t_3} \mat{A}_{i,j,l}\mat{B}_{k,l} \equiv \t{(\mat{B}\t{\mat{A}_{i,:,:}})}_{j,k} & \text{for }t_3 = q.
Letting $\ten{F}$ be the $3$-tensor of dimensions $n\times k\times r$ such that $\ten{F}_{i,:,:} = \mat{f}_{y_i}$, then
\mat{\beta}\mat{f}_{y_i}\t{\mat{\alpha}} = (\ten{F}\ttm[2]\mat{\beta}\ttm[3]\mat{\alpha})_{i,:,:}
or in other words, the $i$-th slice of the tensor product $\ten{F}\ttm[2]\mat{\beta}\ttm[3]\mat{\alpha}$ contains $\mat{\beta}\mat{f}_{y_i}\t{\mat{\alpha}}$ for $i = 1, ..., n$.
Another analog way of writing this is
(\ten{F}\ttm[2]\mat{\beta}\ttm[3]\mat{\alpha})_{(1)} = \mathbb{F}_{y}(\t{\mat{\alpha}}\kron\t{\mat{\beta}})
In this section we give a short summary of alternative but equivalent operations.
Using the notation $\widehat{=}$ to indicate that two expressions are identical in the sense that they contain the same element in the same order but may have different dimensions. Meaning, when vectorizing ether side of $\widehat{=}$, they are equal ($\mat{A}\widehat{=}\mat{B}\ :\Leftrightarrow\ \vec{\mat{A}} = \vec{\mat{B}}$).
Therefore, we use $\mat{A}, \mat{B}, \mat{X}, \mat{F}, \mat{R}, ...$ for matrices. 3-Tensors are written as $\ten{A}, \ten{B}, \ten{T}, \ten{X}, \ten{F}, \ten{R}, ...$.
\ten{T}\ttm[3]\mat{A}\ &{\widehat{=}}\ \mat{T}\t{\mat A} & \ten{T}(n, p, q)\ \widehat{=}\ \mat{T}(n p, q), \mat{A}(p, q) \\
\ten{T}\ttm[2]\mat{B}\ &{\widehat{=}}\ \mat{B}\ten{T}_{(2)} & \ten{T}(n, p, q), \ten{T}_{(2)}(p, n q), \mat{B}(q, p)
% \section{Matrix Valued Normal Distribution}
% A random variable $\mat{X}$ of dimensions $p\times q$ is \emph{Matrix-Valued Normal Distribution}, denoted
% \begin{displaymath}
% \mat{X}\sim\mathcal{MN}_{p\times q}(\mat{\mu}, \mat{\Delta}_2, \mat{\Delta}_1),
% \end{displaymath}
% if and only if $\vec\mat{X}\sim\mathcal{N}_{p q}(\vec\mat\mu, \mat\Delta_1\otimes\mat\Delta_2)$. Note the order of the covariance matrices $\mat\Delta_1, \mat\Delta_2$. Its density is given by
% \begin{displaymath}
% f(\mat{X}) = \frac{1}{(2\pi)^{p q / 2}|\mat\Delta_1|^{p / 2}|\mat\Delta_2|^{q / 2}}\exp\left(-\frac{1}{2}\tr(\mat\Delta_1^{-1}\t{(\mat X - \mat \mu)}\mat\Delta_2^{-1}(\mat X - \mat \mu))\right).
% \end{displaymath}
% \section{Sampling form a Multi-Array Normal Distribution}
% Let $\ten{X}$ be an order (rank) $r$ Multi-Array random variable of dimensions $p_1\times...\times p_r$ following a Multi-Array (or Tensor) Normal distributed
% \begin{displaymath}
% \ten{X}\sim\mathcal{TN}(\mu, \mat{\Delta}_1, ..., \mat{\Delta}_r).
% \end{displaymath}
% Its density is given by
% \begin{displaymath}
% f(\ten{X}) = \Big( \prod_{i = 1}^r \sqrt{(2\pi)^{p_i}|\mat{\Delta}_i|^{q_i}} \Big)^{-1}
% \exp\!\left( -\frac{1}{2}\langle \ten{X} - \mu, (\ten{X} - \mu)\times\{\mat{\Delta}_1^{-1}, ..., \mat{\Delta}_r^{-1}\} \rangle \right)
% \end{displaymath}
% with $q_i = \prod_{j \neq i}p_j$. This is equivalent to the vectorized $\vec\ten{X}$ following a Multi-Variate Normal distribution
% \begin{displaymath}
% \vec{\ten{X}}\sim\mathcal{N}_{p}(\vec{\mu}, \mat{\Delta}_r\otimes...\otimes\mat{\Delta}_1)
% \end{displaymath}
% with $p = \prod_{i = 1}^r p_i$.
% \todo{Check this!!!}
\subsection{Alternative covariance estimates}\label{sec:approx}
An alternative approach is \emph{not} to use the MLE estimates for $\mat\Delta_1$, $\mat\Delta_2$ but (up to scaling) unbiased estimates.
\widetilde{\mat\Delta}_1 = \frac{1}{n}\sum_{i = 1}^n \t{\mat{R}_i}\mat{R}_i {\color{gray}\quad(q\times q)},\qquad
\widetilde{\mat\Delta}_2 = \frac{1}{n}\sum_{i = 1}^n \mat{R}_i\t{\mat{R}_i} {\color{gray}\quad(p\times p)}.
The unbiasednes comes directly from the following short computation;
(\E\widetilde{\mat\Delta}_1)_{j,k} = \frac{1}{n}\sum_{i = 1}^n \sum_{l = 1}^p \E \mat{R}_{i,l,j}\mat{R}_{i,l,k}
= \frac{1}{n}\sum_{i = 1}^n \sum_{l = 1}^p (\mat{\Delta}_{2})_{l,l}(\mat{\Delta}_{1})_{j,k}
= (\mat\Delta_1\tr(\mat\Delta_2))_{j,k}.
which means that $\E\widetilde{\mat\Delta}_1 = \mat\Delta_1\tr(\mat\Delta_2)$ and in analogy $\E\widetilde{\mat\Delta}_2 = \mat\Delta_2\tr(\mat\Delta_1)$. Now, we need to handle the scaling which can be estimated unbiasedly by
\tilde{s} = \frac{1}{n}\sum_{i = 1}^n \|\mat{R}_i\|_F^2
because with $\|\mat{R}_i\|_F^2 = \tr \mat{R}_i\t{\mat{R}_i} = \tr \t{\mat{R}_i}\mat{R}_i$ the scale estimate $\tilde{s} = \tr(\widetilde{\mat\Delta}_1) = \tr(\widetilde{\mat\Delta}_2)$. Then $\E\tilde{s} = \tr(\E\widetilde{\mat\Delta}_1) = \tr{\mat\Delta}_1 \tr{\mat\Delta}_2 = \tr({\mat\Delta}_1\otimes{\mat\Delta}_2)$. Leading to the estimate of the covariance as
\widetilde{\mat\Delta} = \tilde{s}^{-1}(\widetilde{\mat{\Delta}}_1\otimes\widetilde{\mat{\Delta}}_2)
\todo{ prove they are consistent, especially $\widetilde{\mat\Delta} = \tilde{s}^{-1}(\widetilde{\mat\Delta}_1\otimes\widetilde{\mat\Delta}_2)$!}
The hoped for a benefit is that these covariance estimates are in a closed form which means there is no need for an additional iterative estimations step. Before we start with the derivation of the gradients define the following two quantities
\mat{S}_1 = \frac{1}{n}\sum_{i = 1}^n \t{\mat{R}_i}\widetilde{\mat{\Delta}}_2^{-1}\mat{R}_i = \frac{1}{n}\ten{R}_{(3)}\t{(\ten{R}\ttm[2]\widetilde{\mat{\Delta}}_2^{-1})_{(3)}}\quad{\color{gray}(q\times q)}, \\
\mat{S}_2 = \frac{1}{n}\sum_{i = 1}^n \mat{R}_i\widetilde{\mat{\Delta}}_1^{-1}\t{\mat{R}_i} = \frac{1}{n}\ten{R}_{(2)}\t{(\ten{R}\ttm[3]\widetilde{\mat{\Delta}}_1^{-1})_{(2)}}\quad{\color{gray}(p\times p)}.
\todo{Check tensor form!}
Now, the matrix normal with the covariance matrix of the vectorized quantities of the form $\mat{\Delta} = s^{-1}(\mat{\Delta}_1\otimes\mat{\Delta}_2)$ has the form
f(\mat R)
&= \frac{1}{\sqrt{(2\pi)^{p q}|\mat\Delta|}}\exp\left(-\frac{1}{2}\t{\vec(\mat{R})} \mat\Delta^{-1}\vec(\mat{R})\right) \\
&= \frac{s^{p q / 2}}{(2\pi)^{p q / 2}|\mat\Delta_1|^{p / 2}|\mat\Delta_2|^{q / 2}}\exp\left(-\frac{s}{2}\tr(\mat\Delta_1^{-1}\t{\mat{R}}\mat\Delta_2^{-1}\mat{R})\right)
The approximated log-likelihood is then
\tilde{l}(\mat\alpha, \mat\beta)
-\frac{n p q}{2}\log{2\pi}
-\frac{1}{2}\sum_{i = 1}^n \t{\mat{r}_i}\widetilde{\mat{\Delta}}^{-1}\mat{r}_i \\
-\frac{n p q}{2}\log{2\pi}
+\frac{n p q}{2}\log\tilde{s}
-\frac{n p}{2}\log|\widetilde{\mat{\Delta}}_1|
-\frac{n q}{2}\log|\widetilde{\mat{\Delta}}_2|
-\frac{\tilde{s}}{2}\sum_{i = 1}^n \tr(\widetilde{\mat{\Delta}}_1^{-1}\t{\mat{R}_i}\widetilde{\mat{\Delta}}_2^{-1}\mat{R}_i).
The second form is due to the property of the determinant for scaling and the Kronecker product giving that $|\widetilde{\mat\Delta}| = (\tilde{s}^{-1})^{p q}|\widetilde{\mat{\Delta}}_1|^p |\widetilde{\mat{\Delta}}_2|^q$ as well as an analog Kronecker decomposition as in the MLE case.
Note that with the following holds
\sum_{i = 1}^n \tr(\widetilde{\mat{\Delta}}_1^{-1}\t{\mat{R}_i}\widetilde{\mat{\Delta}}_2^{-1}\mat{R}_i)
= n \tr(\widetilde{\mat{\Delta}}_1^{-1}\mat{S}_1)
= n \tr(\widetilde{\mat{\Delta}}_2^{-1}\mat{S}_2)
= n \tr(\mat{S}_1\widetilde{\mat{\Delta}}_1^{-1})
= n \tr(\mat{S}_2\widetilde{\mat{\Delta}}_2^{-1}).
The derivation of the Gradient of the approximated log-likelihood $\tilde{l}$ is tedious but straight forward. We tackle the summands separately;
\d\log\tilde{s} &= \tilde{s}^{-1}\d\tilde{s} = \frac{2}{n\tilde{s}}\sum_{i = 1}^n \tr(\t{\mat{R}_i}\d\mat{R}_i)
= -\frac{2}{n\tilde{s}}\sum_{i = 1}^n \tr(\t{\mat{f}_{y_i}}\t{\mat{\beta}}\mat{R}_i\d\mat{\alpha} + \mat{f}_{y_i}\t{\mat{\alpha}}\t{\mat{R}_i}\d\mat{\beta}), \\
\d\log|\widetilde{\mat{\Delta}}_1| &=\tr(\widetilde{\mat{\Delta}}_1^{-1}\d\widetilde{\mat{\Delta}}_1) = \frac{2}{n}\sum_{i = 1}^n \tr(\widetilde{\mat{\Delta}}_1^{-1}\t{\mat{R}_i}\d\mat{R}_i)
= -\frac{2}{n}\sum_{i = 1}^n \tr(\t{\mat{f}_{y_i}}\t{\mat{\beta}}\mat{R}_i\widetilde{\mat{\Delta}}_1^{-1}\d\mat{\alpha} + \mat{f}_{y_i}\t{\mat{\alpha}}\widetilde{\mat{\Delta}}_1^{-1}\t{\mat{R}_i}\d\mat{\beta}), \\
\d\log|\widetilde{\mat{\Delta}}_2| &=\tr(\widetilde{\mat{\Delta}}_2^{-1}\d\widetilde{\mat{\Delta}}_2) = \frac{2}{n}\sum_{i = 1}^n \tr(\t{\mat{R}_i}\widetilde{\mat{\Delta}}_2^{-1}\d\mat{R}_i)
= -\frac{2}{n}\sum_{i = 1}^n \tr(\t{\mat{f}_{y_i}}\t{\mat{\beta}}\widetilde{\mat{\Delta}}_2^{-1}\mat{R}_i\d\mat{\alpha} + \mat{f}_{y_i}\t{\mat{\alpha}}\t{\mat{R}_i}\widetilde{\mat{\Delta}}_2^{-1}\d\mat{\beta})
as well as
\d\,\tilde{s}\sum_{i = 1}^n \tr(\widetilde{\mat{\Delta}}_1^{-1}\t{\mat{R}_i}\widetilde{\mat{\Delta}}_2^{-1}\mat{R}_i)
= (\d\tilde{s})\sum_{i = 1}^n \tr(\widetilde{\mat{\Delta}}_1^{-1}\t{\mat{R}_i}\widetilde{\mat{\Delta}}_2^{-1}\mat{R}_i)
+ \tilde{s}\, \d \sum_{i = 1}^n \tr(\widetilde{\mat{\Delta}}_1^{-1}\t{\mat{R}_i}\widetilde{\mat{\Delta}}_2^{-1}\mat{R}_i).
We have
\d\tilde{s} = -\frac{2}{n}\sum_{i = 1}^n \tr(\t{\mat{f}_{y_i}}\t{\mat{\beta}}\mat{R}_i\d\mat{\alpha} + \mat{f}_{y_i}\t{\mat{\alpha}}\t{\mat{R}_i}\d\mat{\beta})
and the remaining term
\d\sum_{i = 1}^n\tr(\widetilde{\mat{\Delta}}_1^{-1}\t{\mat{R}_i}\widetilde{\mat{\Delta}}_2^{-1}\mat{R}_i)
= 2\sum_{i = 1}^n \tr(&\t{\mat{f}_{y_i}}\t{\mat{\beta }}(\mat{R}_i \widetilde{\mat{\Delta}}_1^{-1}\mat{S}_1\widetilde{\mat{\Delta}}_1^{-1} + \widetilde{\mat{\Delta}}_2^{-1}\mat{S}_2\widetilde{\mat{\Delta}}_2^{-1} \mat{R}_i - \widetilde{\mat{\Delta}}_2^{-1} \mat{R}_i \widetilde{\mat{\Delta}}_1^{-1})\d\mat{\alpha} \\
+\,&\mat{f}_{y_i} \t{\mat{\alpha}}(\widetilde{\mat{\Delta}}_1^{-1}\mat{S}_1\widetilde{\mat{\Delta}}_1^{-1}\t{\mat{R}_i} + \t{\mat{R}_i}\widetilde{\mat{\Delta}}_2^{-1}\mat{S}_2\widetilde{\mat{\Delta}}_2^{-1} - \widetilde{\mat{\Delta}}_1^{-1}\t{\mat{R}_i}\widetilde{\mat{\Delta}}_2^{-1})\d\mat{\beta }).
The last one is tedious but straight forward. Its computation extensively uses the symmetry of $\widetilde{\mat{\Delta}}_1$, $\widetilde{\mat{\Delta}}_2$, the cyclic property of the trace and the relation $\d\mat{A}^{-1} = -\mat{A}^{-1}(\d\mat{A})\mat{A}^{-1}$.
Putting it all together
\d\tilde{l}(\mat{\alpha}, \mat{\beta})
&= \frac{n p q}{2}\Big(-\frac{2}{n\tilde{s}}\Big)\sum_{i = 1}^n \tr(\t{\mat{f}_{y_i}}\t{\mat{\beta}}\mat{R}_i\d\mat{\alpha} + \mat{f}_{y_i}\t{\mat{\alpha}}\t{\mat{R}_i}\d\mat{\beta}) \\
&\hspace{3em} - \frac{n p}{2}\Big(-\frac{2}{n}\Big)\sum_{i = 1}^n \tr(\t{\mat{f}_{y_i}}\t{\mat{\beta}}\mat{R}_i\widetilde{\mat{\Delta}}_1^{-1}\d\mat{\alpha} + \mat{f}_{y_i}\t{\mat{\alpha}}\widetilde{\mat{\Delta}}_1^{-1}\t{\mat{R}_i}\d\mat{\beta}) \\
&\hspace{3em} - \frac{n q}{2}\Big(-\frac{2}{n}\Big)\sum_{i = 1}^n \tr(\t{\mat{f}_{y_i}}\t{\mat{\beta}}\widetilde{\mat{\Delta}}_2^{-1}\mat{R}_i\d\mat{\alpha} + \mat{f}_{y_i}\t{\mat{\alpha}}\t{\mat{R}_i}\widetilde{\mat{\Delta}}_2^{-1}\d\mat{\beta}) \\
&\hspace{3em} -\frac{1}{2}\Big(-\frac{2}{n}\Big)\Big(\sum_{i = 1}^n \tr(\widetilde{\mat{\Delta}}_1^{-1}\t{\mat{R}_i}\widetilde{\mat{\Delta}}_2^{-1}\mat{R}_i)\Big)\sum_{i = 1}^n \tr(\t{\mat{f}_{y_i}}\t{\mat{\beta}}\mat{R}_i\d\mat{\alpha} + \mat{f}_{y_i}\t{\mat{\alpha}}\t{\mat{R}_i}\d\mat{\beta}) \\
&\hspace{3em} -\frac{\tilde{s}}{2}2\sum_{i = 1}^n \tr\!\Big(\t{\mat{f}_{y_i}}\t{\mat{\beta }}(\mat{R}_i \widetilde{\mat{\Delta}}_1^{-1}\mat{S}_1\widetilde{\mat{\Delta}}_1^{-1} + \widetilde{\mat{\Delta}}_2^{-1}\mat{S}_2\widetilde{\mat{\Delta}}_2^{-1} \mat{R}_i - \widetilde{\mat{\Delta}}_2^{-1} \mat{R}_i \widetilde{\mat{\Delta}}_1^{-1})\d\mat{\alpha} \\
&\hspace{3em} \hspace{4.7em} + \mat{f}_{y_i} \t{\mat{\alpha}}(\widetilde{\mat{\Delta}}_1^{-1}\mat{S}_1\widetilde{\mat{\Delta}}_1^{-1}\t{\mat{R}_i} + \t{\mat{R}_i}\widetilde{\mat{\Delta}}_2^{-1}\mat{S}_2\widetilde{\mat{\Delta}}_2^{-1} - \widetilde{\mat{\Delta}}_1^{-1}\t{\mat{R}_i}\widetilde{\mat{\Delta}}_2^{-1})\d\mat{\beta }\Big) \\
&= \sum_{i = 1}^n \tr\bigg(\t{\mat{f}_{y_i}}\t{\mat{\beta}}\Big(
-p q \tilde{s}^{-1} \mat{R}_i + p \mat{R}_i\widetilde{\mat{\Delta}}_1^{-1} + q \widetilde{\mat{\Delta}}_2^{-1}\mat{R}_i + \tr(\widetilde{\mat{\Delta}}_1^{-1}\mat{S}_1)\mat{R}_i \\
&\hspace{3em} \hspace{4.7em} - \tilde{s}(\mat{R}_i \widetilde{\mat{\Delta}}_1^{-1}\mat{S}_1\widetilde{\mat{\Delta}}_1^{-1} + \widetilde{\mat{\Delta}}_2^{-1}\mat{S}_2\widetilde{\mat{\Delta}}_2^{-1} \mat{R}_i - \widetilde{\mat{\Delta}}_2^{-1} \mat{R}_i \widetilde{\mat{\Delta}}_1^{-1})
\Big)\d\mat{\alpha}\bigg) \\
&\hspace{3em}+ \sum_{i = 1}^n \tr\bigg(\mat{f}_{y_i}\t{\mat{\alpha}}\Big(
-p q \tilde{s}^{-1} \t{\mat{R}_i} + p \widetilde{\mat{\Delta}}_1^{-1}\t{\mat{R}_i} + q \t{\mat{R}_i}\widetilde{\mat{\Delta}}_2^{-1} + \tr(\widetilde{\mat{\Delta}}_1^{-1}\mat{S}_1)\t{\mat{R}_i} \\
&\hspace{3em}\hspace{3em} \hspace{4.7em} - \tilde{s}(\widetilde{\mat{\Delta}}_1^{-1}\mat{S}_1\widetilde{\mat{\Delta}}_1^{-1}\t{\mat{R}_i} + \t{\mat{R}_i}\widetilde{\mat{\Delta}}_2^{-1}\mat{S}_2\widetilde{\mat{\Delta}}_2^{-1} - \widetilde{\mat{\Delta}}_1^{-1} \t{\mat{R}_i} \widetilde{\mat{\Delta}}_2^{-1})
Observe that the bracketed expressions before $\d\mat{\alpha}$ and $\d\mat{\beta}$ are transposes. Lets denote the expression for $\d\mat{\alpha}$ as $\mat{G}_i$ which has the form
= (\tr(\widetilde{\mat{\Delta}}_1^{-1}\mat{S}_1) - p q \tilde{s}^{-1})\mat{R}_i
+ (q\mat{I}_p - \tilde{s}\widetilde{\mat{\Delta}}_2^{-1}\mat{S}_2)\widetilde{\mat{\Delta}}_2^{-1}\mat{R}_i
+ \mat{R}_i\widetilde{\mat{\Delta}}_1^{-1}(p\mat{I}_q - \tilde{s}\mat{S}_1\widetilde{\mat{\Delta}}_1^{-1})
+ \tilde{s}\widetilde{\mat{\Delta}}_2^{-1}\mat{R}_i\widetilde{\mat{\Delta}}_1^{-1}
and with $\mathcal{G}$ the order 3 tensor stacking the $\mat{G}_i$'s such that the first mode indexes the observation
= (\tr(\widetilde{\mat{\Delta}}_1^{-1}\mat{S}_1) - p q \tilde{s}^{-1})\ten{R}
+ \ten{R}\ttm[2](q\mat{I}_p - \tilde{s}\widetilde{\mat{\Delta}}_2^{-1}\mat{S}_2)\widetilde{\mat{\Delta}}_2^{-1}
+ \ten{R}\ttm[3](p\mat{I}_q - \tilde{s}\widetilde{\mat{\Delta}}_1^{-1}\mat{S}_1)\widetilde{\mat{\Delta}}_1^{-1}
+ \tilde{s}\ten{R}\ttm[2]\widetilde{\mat{\Delta}}_2^{-1}\ttm[3]\widetilde{\mat{\Delta}}_1^{-1}
This leads to the following form of the differential of $\tilde{l}$ given by
\d\tilde{l}(\mat{\alpha}, \mat{\beta})
= \sum_{i = 1}^n \tr(\t{\mat{f}_{y_i}}\t{\mat{\beta}}\mat{G}_i\d\mat{\alpha})
+ \sum_{i = 1}^n \tr(\mat{f}_{y_i}\t{\mat{\alpha}}\t{\mat{G}_i}\d\mat{\beta})
and therefore the gradients
\nabla_{\mat{\alpha}}\tilde{l}(\mat{\alpha}, \mat{\beta}) &= \sum_{i = 1}^n \t{\mat{G}_i}\mat{\beta}\mat{f}_{y_i}
= \ten{G}_{(3)}\t{(\ten{F}\ttm[2]\mat{\beta})_{(3)}}, \\
\nabla_{\mat{\beta}} \tilde{l}(\mat{\alpha}, \mat{\beta}) &= \sum_{i = 1}^n \mat{G}_i\mat{\alpha}\t{\mat{f}_{y_i}}
= \ten{G}_{(2)}\t{(\ten{F}\ttm[3]\mat{\alpha})_{(2)}}.
\todo{check the tensor version of the gradient!!!}
%%% Reference Summaries %%%
\section{Reference Summaries}
This section contains short summaries of the main references with each sub-section concerning one paper.
\subsection{Generalized Tensor Decomposition With Features on Multiple Modes}
The \cite{TensorDecomp-HuLeeWang2022} paper proposes a multi-linear conditional mean model for a constraint rank tensor decomposition. Let the responses $\ten{Y}\in\mathbb{R}^{d_1\times ... \times\d_K}$ be an order $K$ tensor. Associated with each mode $k\in[K]$ they assume feature matrices $\mat{X}_k\in\mathbb{R}^{d_k\times p_k}$. Now, they assume that conditional on the feature matrices $\mat{X}_k$ the entries of the tensor $\ten{Y}$ are independent realizations. The rank constraint is specified through $\mat{r} = (r_1, ..., r_K)$, then the model is given by
\E(\ten{Y} | \mat{X}_1, ..., \mat{X}_K) = f(\ten{C}\times\{ \mat{X}_1\mat{M}_1, ..., \mat{X}_K\mat{M}_K \}),\qquad \t{\mat{M}_k}\mat{M}_k = \mat{I}_{r_k}\ \forall k\in[K].
The order $K$ tensor $\ten{C}\in\mathbb{R}^{r_1\times...\times r_K}$ is an unknown full-rank core tensor and the matrices $\mat{M}_k\in\mathbb{R}^{p_k\times r_k}$ are unknown factor matrices. The function $f$ is applied element wise and serves as the link function based on the assumed distribution family of the tensor entries. Finally, the operation $\times$ denotes the tensor-by-matrix product using a short hand
\ten{C}\times\{ \mat{X}_1\mat{M}_1, ..., \mat{X}_K\mat{M}_K \}
= \ten{C}\ttm[1]\mat{X}_1\mat{M}_1\ ...\ttm[K]\mat{X}_K\mat{M}_K
with $\ttm[k]$ denoting the $k$-mode tensor matrix product.
The algorithm for estimation of $\ten{C}$ and $\mat{M}_1, ..., \mat{M}_K$ assumes the individual conditional entries of $\ten{Y}$ to be independent and to follow a generalized linear model with link function $f$. The proposed algorithm is an iterative algorithm for minimizing the negative log-likelihood
l(\ten{C}, \mat{M}_1, ..., \mat{M}_K) = \langle \ten{Y}, \Theta \rangle - \sum_{i_1, ..., i_K} b(\Theta_{i_1, ..., i_K}), \qquad \Theta = \ten{C}\times\{ \mat{X}_1\mat{M}_1, ..., \mat{X}_K\mat{M}_K \}
where $b = f'$ it the derivative of the canonical link function $f$ in the generalized linear model the conditioned entries of $\ten{Y}$ follow. The algorithm utilizes the higher-order SVD (HOSVD) to enforce the rank-constraint.
The main benefit is that this approach generalizes well to a multitude of different structured data sets.
\todo{ how does this relate to the $\mat{X} = \mat{\mu} + \mat{\beta}\mat{f}_y\t{\mat{\alpha}} + \mat{\epsilon}$ model.}