effie

2024-04-11 16:08:00 +02:00 · 2024-04-11 16:08:00 +02:00 · 7a68948d26
commit 7a68948d26
parent 8fd80522f0
2 changed files with 1196 additions and 1154 deletions
--- a/LaTeX/main.bib
+++ b/LaTeX/main.bib
--- a/LaTeX/paper.tex
+++ b/LaTeX/paper.tex
@ -281,23 +281,45 @@
 \section{Introduction}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

-In Statistics, tensors are a mathematical tool to represent data of complex structure. In this paper, \textit{tensors} are considered as a generalization of matrices to higher dimensions: A tensor is a multi-dimensional array of numbers. For example, a second-order tensor can be represented as a matrix, while a third-order tensor can be represented as a cube of matrices. 
+Tensors are a mathematical tool to represent data of complex structure in statistics. \textit{Tensors} are considered as a generalization of matrices to higher dimensions: A tensor is a multi-dimensional array of numbers. For example, a second-order tensor can be represented as a matrix, while a third-order tensor can be represented as a cube of matrices. 

-Complex data are collected at different times and/or under several conditions often involving a large number of multi-indexed variables represented as tensor-valued data \parencite{KoldaBader2009}. They occur in large-scale longitudinal studies \parencite[e.g.][]{Hoff2015}, in agricultural experiments and chemometrics and spectroscopy \parencite[e.g.][]{LeurgansRoss1992,Burdick1995}. Also, in signal and video processing where sensors produce multi-indexed data, e.g. over spatial, frequency, and temporal dimensions \parencite[e.g.][]{DeLathauwerCastaing2007,KofidisRegalia2005}, in telecommunications \parencite[e.g.][]{DeAlmeidaEtAl2007}. Other examples of multiway data include 3D images of the brain, where the modes are the 3 spatial dimensions, and spatio-temporal weather imaging data, a set of image sequences represented as 2 spatial modes and 1 temporal mode.
+Complex data are collected at different times and/or under several conditions often involving a large number of multi-indexed variables represented as tensor-valued data \parencite{KoldaBader2009}. They occur in large-scale longitudinal studies \parencite[e.g.][]{Hoff2015}, in agricultural experiments and chemometrics and spectroscopy \parencite[e.g.][]{LeurgansRoss1992,Burdick1995}, in signal and video processing where sensors produce multi-indexed data, e.g. over spatial, frequency, and temporal dimensions \parencite[e.g.][]{DeLathauwerCastaing2007,KofidisRegalia2005}, and in telecommunications \parencite[e.g.][]{DeAlmeidaEtAl2007}. Other examples of multiway data include 3D images of the brain, where the modes are the 3 spatial dimensions, and spatio-temporal weather imaging data, a set of image sequences represented as 2 spatial modes and 1 temporal mode.

-\begin{itemize}
-    \item Review \cite{ZhouLiZhu2013} and see how you compare with them. They focus on the forward regression model with a scalar response but they claim that "Exploit- ing the array structure in imaging data, the new method substantially reduces the dimensionality of imaging data, which leads to efficient estimation and prediction."
-    \item Read \cite{ZhouEtAl2023} to figure out the distribution they use for the tensor-valued predictors and briefly describe what they do.
-    \item Read \cite{RabusseauKadri2016} to figure out what they do. They seem to draw both the response and the predictors from tensor-normal with iid N(0,1) entries: "In order to leverage the tensor structure of the output data, we formulate the problem as the minimization of a least squares criterion subject to a multilinear rank constraint on the regression tensor. The rank constraint enforces the model to capture low-rank structure in the outputs and to explain dependencies between inputs and outputs in a low-dimensional multilinear subspace."
-    \item 
-\end{itemize}
+% \begin{itemize}
+%     \item Review \cite{ZhouLiZhu2013} and see how you compare with them. They focus on the forward regression model with a scalar response but they claim that "Exploiting the array structure in imaging data, the new method substantially reduces the dimensionality of imaging data, which leads to efficient estimation and prediction."
+%     \item Read \cite{ZhouEtAl2023} to figure out the distribution they use for the tensor-valued predictors and briefly describe what they do.
+%     \item Read \cite{RabusseauKadri2016} to figure out what they do. They seem to draw both the response and the predictors from tensor-normal with iid N(0,1) entries: "In order to leverage the tensor structure of the output data, we formulate the problem as the minimization of a least squares criterion subject to a multilinear rank constraint on the regression tensor. The rank constraint enforces the model to capture low-rank structure in the outputs and to explain dependencies between inputs and outputs in a low-dimensional multilinear subspace."
+% \end{itemize}
+
+% - RabusseauKadri2016  Y | x for tensor Y with vector x (HOLRR; Higher Order Low-Rank Regression)
+% - LiZhang2017         Y | x for tensoe Y with vector x (envelope model)
+% - ZhouEtAl2023        Y | x for tensor Y with vector x (sparse and partially observed)
+
+% - ZhouLiZhu2013       y\in R (GLM) for y | Z, X for tensor X
+% - HaoEtAl2021         y\in R for y | X for tensor X (sparse, element wise B-splines)
+
+% Tensor regression models have been proposed to exploit the special structure of tensor covariates, e.g.  \cite{HaoEtAl2021,ZhouLiZhu2013}, or tensor responses \cite{RabusseauKadri2016,LiZhang2017,ZhouEtAl2023} \cite{HaoEtAl2021} modeled a scalar response as a flexible nonparametric function of tensor covariates. \cite{ZhouLiZhu2013} assume the scalar response has a distribution in the exponential family given the tensor-valued predictors and model the link function as a multilinear function of the predictors. \cite{LiZhang2017} model the tensor-valued response as tensor normal. Rather than using $L_1$ type penalty functions to induce sparsity, they employ the envelope method (Cook, Li, and Chiaromonte Citation2010) to estimate the unknown regression coefficient. Moreover, the envelope method essentially identifies and uses the material information jointly. They develop an estimation algorithm and study the asymptotic properties of the estimator. the scalar response as  These models try to utilize the sparse and low-rank structures in the tensors -- either in the regression coefficient tensor or the response tensor -- to boost performance on the regression task by reducing the number of free parameters.
+
+Tensor regression models have been proposed to leverage the structure inherent in tensor valued data. For instance, \textcite{HaoEtAl2021,ZhouLiZhu2013} focus on tensor covariates, while \textcite{RabusseauKadri2016,LiZhang2017,ZhouLiZhu2013} focus on tensor responses, and \textcite{Hoff2015,Lock2018} consider  tensor on tensor regression. \textcite{HaoEtAl2021} modeled a scalar response as a flexible nonparametric function of tensor covariates. \textcite{ZhouLiZhu2013} assume the scalar response has a distribution in the exponential family given the tensor-valued predictors with the link modeled as a multilinear function of the predictors. \textcite{RabusseauKadri2016} model the tensor-valued response as a linear model with tensor valued regression coefficients subject to a multilinear rank constraint. \textcite{LiZhang2017} approach the problem with a similar linear model but instead of a low rank constraint the error term is assumed to have a separable Kronecker product structure while using a generalization of the envelope model \parencite{CookLiChiaromonte2010}. \textcite{ZhouEtAl2023} focus on partially observed tensor response given vector-valued predictors with mode-wise sparsity constraints in the regression coefficients. \textcite{Hoff2015} extends an existing bilinear regression model to a tensor on tensor of conformable modes and dimensions regression model based on a Tucker product. \textcite{Lock2018} uses a tensor contraction to build a penalized least squares model for a tensor with arbitrary number of modes and dimensions.
+
+Our approach considers the general regression problem of fitting a response of general form (univariate, multivariate, tensor-valued) on a tensor-value predictor. We operate in the context of sufficient dimension reduction \parencite[e.g.]{Cook1998,Li2018} based on inverse regression, which leads us to regressing the tensor-valued predictor on the response. In our setting, this necessitates transforming the response to tensor-valued functions, regardless of whether it is itself tensor-valued. Because of the setting, our method shares commonalities with the tensor regression models referred to above, yet the modeling and methodology are novel. 
+Specifically, our tensor-to-tensor regression model is a generalized multi-linear model similar to the generalized linear model of \cite{ZhouLiZhu2013}. % but with tensor valued response by applying (a known) tensor valued function to the response in an inverse regression setting, reversing the role of response and predictors. 
+To bypass the explosion of number of parameters to estimate, we assume the inverse regression error covariance has Kronecker product structure as do \textcite{LiZhang2017}. Our maximum likelihood-based estimation does not require any penalty terms in contrast to the least squares and/or sparse approaches \cite{????} . In the case of a tensor (multilinear) normal, given the tensor-valued function of the response, our model exhibits similarities to the multilinear modeling of \textcite{Hoff2015}, but we use a generalized multilinear model and estimate the parameters with maximum likelihood instead of least squares. Moreover, a common issue in multilinear tensor regression models is the unidentifiability of the parameters, which we address in a completely different manner. For example,  \cite{LiZhang2017} develop theory that is based on orthogonal projection matrices to uniquely identify a subspace, while our approach is more general as it uses manifold theory. 


-Tensor regression models have been proposed to exploit the special structure of tensor covariates, e.g.  \cite{HaoEtAl2021,ZhouLiZhu2013},  or tensor responses \cite{RabusseauKadri2016,LiZhang2017,ZhouEtAl2023} \cite{HaoEtAl2021} modeled a scalar response as a flexible nonparametric function of tensor covariates. \cite{ZhouLiZhu2013} assume the scalar response has a distribution in the exponential family given the tensor-valued predictors and model the link function as a multilinear function of the predictors. \cite{LiZhang2017} model the tensor-valued response as tensor normal. Rather than using $L_1$ type penalty functions to induce sparsity, they employ the envelope method (Cook, Li, and Chiaromonte Citation2010) to estimate the unknown regression coefficient. Moreover, the envelope method essentially identifies and uses the material information jointly. They develop an estimation algorithm and study the asymptotic properties of the estimator. the scalar response as  These models try to utilize the sparse and low-rank structures in the tensors – either in the regression coefficient tensor or the response tensor – to boost performance on the regression task by reducing the number of free parameters. 
+In this paper we present a model-based \emph{Sufficient Dimension Reduction} (SDR) method for tensor-valued data with distribution in the quadratic exponential family assuming a separable Kronecker product structure of the first and second moment. By generalizing the parameter space to embedded manifolds we obtain consistency and asymptotic normality results while allowing great modeling flexibility in the linear sufficient dimension reduction.
+
+The quadratic exponential family contains the tensor normal and the tensor Ising distributions, for continuous and binary tensor-valued random variables, respectively. 

 Multilinear tensor normal models have been used in various applications, including medical imaging \parencite{BasserPajevic2007,DrydenEtAl2009}, spatio-temporal data analysis \parencite{GreenewaldHero2014}, regression analysis for longitudinal relational data \parencite{Hoff2015}. One of the most important uses of the multilinear normal (MLN) distribution, and hence tensor analysis, is perhaps in magnetic resonance imaging (MRI) \parencite{OhlsonEtAl2013}. A recent survey \parencite{WangEtAl2022} and references therein contain more information and potential applications of multilinear tensor normal models.

-In this paper we present a model-based \emph{Sufficient Dimension Reduction} (SDR) method for tensor-valued data with distribution in the quadratic exponential family assuming a Kronecker product structure of the first and second moment. By generalizing the parameter space to embedded manifolds we obtain consistency and asymtotic normality results while allowing great modeling flexibility in the linear sufficient dimension reduction.
+The Ising model for multivariate binary outcomes belongs to the class of discrete exponential families. Its defining feature is that the sufficient statistic involves a quadratic term to capture correlations arising from pairwise interactions.
+The tensor Ising model is a higher-order Ising model for tensor-valued binary outcomes. 
+%From \cite{MukherjeeEtAl2020}
+Higher-order Ising models arise naturally in the study of multi-atom interactions in lattice gas models, such as the square-lattice eight-vertex model, the Ashkin-Teller model, and Suzuki's pseudo-3D anisotropic model (cf. [6, 33, 36, 37, 49, 55, 56, 61, 62] and the references therein). More recently, higher-order spin systems have also been proposed for modeling peer-group effects in social networks [22]. \textcite{MukherjeeEtAl2020} proposed a maximum pseudo-likelihood estimation algorithm for a one-parameter tensor-Ising model. \efi{Daniel: comment on what these guys do and contrast with your setting} In our approach, the parameter is not constrained to be scalar
+We derive maximum likelihood estimates for all first and second order interactions and propose a gradient-based optimization algorithm. 
+
+Our results in the framework of the quadratic exponential family for tensor-valued variables; i.e., consistency and asymptotic normality, apply to both tensor normal and tensor Ising models. 

 The structure of this paper is as follows. In \cref{sec:notation} we introduce our notation. \Cref{sec:problem-formulation} decribes the exact problem and in \cref{sec:gmlm-model} we introduce our model. Continuing in \cref{sec:ml-estimation} we provide the basis for a general maximum likelihood estimation procedure and derive specialized methods for tensor normal as well as the tensor Ising distributions. \Cref{sec:manifolds} gives a short introduction into manifolds and provides the basis for applying the consistency and asymtotic normality results from \cref{sec:asymtotics}. Simulations for continuous and binary predicotrs are subject of \cref{sec:simulations}. Finally, in \cref{sec:data-analysis} we apply our model to EEG data and perform a prove of concept data analysis example where a chess board is interprated as a collection of binary $8\times 8$ matrices.

@ -359,7 +381,7 @@ where the vectorized quantities $\vec{\ten{X}}\in\mathbb{R}^p$ and $\vec\ten{F}(
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{Problem Formulation}\label{sec:problem-formulation}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-Our goal is to infer the cumulative distribution function (cdf) $F$ of $Y\mid \ten{X}$, where $\ten{X}$ is assumed to admit $r$-tensor structure of dimension $p_1\times ... \times p_r$ with continuous or discrete entries and the response  $Y$ is unconstrained. The predictor $\ten{X}$ is a complex object; to simplify the problem we assume there exists a tensor valued  function of lower dimension $\ten{R}:\ten{X}\mapsto \ten{R}(\ten{X})$ such that 
+Our goal is to infer the cumulative distribution function (cdf) $F$ of $Y\mid \ten{X}$, where $\ten{X}$ is assumed to admit $r$-tensor structure of dimension $p_1\times ... \times p_r$ with continuous or discrete entries, and the response  $Y$ is unconstrained. The predictor $\ten{X}$ is a complex object; to simplify the problem we assume there exists a tensor-valued function of lower dimension $\ten{R}:\ten{X}\mapsto \ten{R}(\ten{X})$ such that 
 \begin{displaymath}
    F(Y\mid \ten{X}) = F(Y\mid \ten{R}(\ten{X})).
 \end{displaymath}
@ -379,7 +401,7 @@ f_{\mat{\eta}_y}(\ten{X}\mid Y = y)
    &= h(\ten{X})\exp(\t{\mat{\eta}_y}\mat{t}(\ten{X}) - b(\mat{\eta}_y)) \nonumber \\
    &= h(\ten{X})\exp(\langle \mat{t}_1(\ten{X}), \mat{\eta}_{1y} \rangle + \langle \mat{t}_2(\ten{X}), \mat{\eta}_{2y} \rangle - b(\mat{\eta}_{y})) \label{eq:quad-density}
 \end{align}
-where $\mat{t}_1(\ten{X})=\vec \ten{X}$ and $\mat{t}_2(\ten{X})$ is linear in $\ten{X}\circ\ten{X}$. The dependence of $\ten{X}$ on $Y$ is fully captured in the natural parameter $\mat{\eta}_y$. The function $h$ is non-negative real-valued and $b$ is assumed to be at least twice continuously differentiable and strictly convex. An important feature of the \emph{quadratic exponential family} is that the distribution of its members is fully characterized by their first two moments. Distributions within the quadratic exponential family include the \emph{tensor normal} (\cref{sec:tensor-normal-estimation}) and \emph{tensor Ising model} (\cref{sec:ising_estimation}, a generalization of the (inverse) Ising model which is multi-variate Bernoulli with up to second order interactions) and mixtures of these two. 
+where $\mat{t}_1(\ten{X})=\vec \ten{X}$ and $\mat{t}_2(\ten{X})$ is linear in $\ten{X}\circ\ten{X}$. The dependence of $\ten{X}$ on $Y$ is fully captured in the natural parameter $\mat{\eta}_y$. The function $h$ is non-negative real-valued and $b$ is assumed to be at least twice continuously differentiable and strictly convex. An important feature of the \emph{quadratic exponential family} is that the distribution of its members is fully characterized by their first two moments. Distributions within the quadratic exponential family include the \emph{tensor normal} (\cref{sec:tensor-normal-estimation}) and \emph{tensor Ising model} (\cref{sec:ising_estimation}, a generalization of the (inverse)\footnote{\todo{}} Ising model which is multi-variate Bernoulli with up to second order interactions) and mixtures of these two. 

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{The Generalized Multi-Linear Model}\label{sec:gmlm-model}
@ -440,7 +462,7 @@ Under the quadratic exponential family model \eqref{eq:quadratic-exp-fam},  a su

 The reduction in vectorized form is $\vec\ten{R}(\ten{X})=\t{\mat{B}}\vec(\ten{X} - \E\ten{X})$, where $\mat{B} = \bigotimes_{k = r}^{1}\mat{\beta}_k$ with $\Span(\mat{B}) = \Span(\{\mat{\eta}_{1y} - \E_{Y}\mat{\eta}_{1Y} : y\in\mathcal{S}_Y\})$, using $\mathcal{S}_Y$ to denote the set of values of the random variable $Y$.

-\cref{thm:sdr} obtains that the \emph{sufficient reduction} $\ten{R}(\ten{X})$ reduces $\ten{X}$ along each dimension linearly. The graph in \cref{fig:SDRvisual} is a visual depiction of the sufficient reduction.
+\cref{thm:sdr} obtains the \emph{sufficient reduction} $\ten{R}(\ten{X})$ reduces $\ten{X}$ along each dimension linearly. The graph in \cref{fig:SDRvisual} is a visual depiction of the sufficient reduction.

 \begin{figure}
    \centering
@ -464,7 +486,7 @@ The reduction in vectorized form is $\vec\ten{R}(\ten{X})=\t{\mat{B}}\vec(\ten{X
 \end{example}

 \begin{example}[Matrix valued $\mat{X}$ ($r = 2$)]
-    Assuming $\mat{X}$ to be matrix valued, that is $r = 2$, $\mat{\theta} = (\overline{\mat{\eta}}, \mat{\beta}_1, \mat{\beta}_2, \mat{\Omega}_1, \mat{\Omega}_2)$, where the intercept term $\overline{\mat{\eta}}\in\mathbb{R}^{p_1\times p_2}$ is now matrix valued. Similar to \cref{ex:vector-dist} with $\mat{F}_y\in\mathbb{R}^{q_1\times q_2}$ being matrix valued, the conditional density of $\mat{X}\mid Y = y$ reads
+    Assuming $\mat{X}$ to be matrix valued, that is $r = 2$, $\mat{\theta} = (\overline{\mat{\eta}}, \mat{\beta}_1, \mat{\beta}_2, \mat{\Omega}_1, \mat{\Omega}_2)$, where the intercept term $\overline{\mat{\eta}}\in\mathbb{R}^{p_1\times p_2}$ is now matrix valued. Similar to \cref{ex:vector-dist} with $\mat{F}_y\in\mathbb{R}^{q_1\times q_2}$ being matrix valued, the conditional density of $\mat{X}\mid Y = y$ is
    \begin{align*}
        f_{\mat{\theta}}(\mat{X}\mid Y = y)
            &= h(\mat{X})\exp(\langle\vec{\mat{X}}, \mat{\eta}_{1y}(\mat{\theta})\rangle + \langle\vec(\mat{X}\circ\mat{X}), \mat{\eta}_2(\mat{\theta})\rangle - b(\mat{\eta}_y(\mat{\theta}))) \\
@ -502,7 +524,8 @@ A straightforward and general method for parameter estimation is \emph{gradient
 If $\mat{T}_2$ is the identity matrix $\mat{I}_{p(p + 1) / 2}$, then $\ten{G}_2(\mat{\eta}_y) = \ten{g}_2(\mat{\eta}_y)$.
 \end{theorem}

-Although the general case of any GMLM model can be fitted via gradient descent using \cref{thm:grad}, this may be very inefficient. In \cref{thm:grad}, $\mat{T}_2$ can be used to introduce flexible second moment structures. For example, it allows modeling effects differently for predictor components, as described in \cref{sec:ising_estimation} after Eqn. \eqref{eq:ising-cond-prob}. In the remainder, we focus on  $\mat{T}_2$'s that are identity matrices. This approach simplifies the estimation algorithm and the speed of the numerical calculation in  the case of tensor normal predictors.  In the case of the tensor normal distribution, an iterative cyclic updating scheme is derived in \cref{sec:tensor-normal-estimation}, which  has much faster convergence, is stable and does not require hyper parameters, as will be discussed later. On the other hand, the Ising model does not allow such a scheme. There we need to use a gradient based method, which is the subject of \cref{sec:ising_estimation}.
+Although the general case of any GMLM model can be fitted via gradient descent using \cref{thm:grad}, this may be very inefficient. In \cref{thm:grad}, $\mat{T}_2$ can be used to introduce flexible second moment structures. For example, it allows modeling effects differently for predictor components, as described in \cref{sec:ising_estimation} after Eqn. \eqref{eq:ising-cond-prob}. In the remainder, we focus on  $\mat{T}_2$'s that are identity matrices. This approach simplifies the estimation algorithm and the speed of the numerical calculation in  the case of tensor normal predictors. % In the case of the tensor normal distribution, 
+An iterative cyclic updating scheme is derived in \cref{sec:tensor-normal-estimation}, which  has much faster convergence, is stable and does not require hyperparameters, as will be discussed later. On the other hand, the Ising model does not allow such a scheme. There we need to use a gradient-based method, which is the subject of \cref{sec:ising_estimation}.

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{Tensor Normal}\label{sec:tensor-normal-estimation}
@ -512,7 +535,14 @@ representing a vector of observations \parencite{OhlsonEtAl2013}. \textcite{Koll

 The defining feature of the matrix normal distribution, and its tensor extension, is the Kronecker product structure of its covariance. This formulation, where the covariates are multivariate normal with multiway covariance structure modeled as a Kronecker product of matrices of much lower dimension, aims to overcome the significant modeling and computational challenges arising from the high computational complexity of manipulating tensor representations \parencite[see, e.g.,][]{HillarLim2013,WangEtAl2022}. 

-% Multilinear tensor normal models have been used in various applications, including medical imaging \parencite{BasserPajevic2007,DrydenEtAl2009}, spatio-temporal data analysis \parencite{GreenewaldHero2014}, regression analysis for longitudinal relational data \parencite{Hoff2015}. One of the most important uses of the multilinear normal (MLN) distribution, and hence tensor analysis, is perhaps in magnetic resonance imaging (MRI) \parencite{OhlsonEtAl2013}. A recent survey \parencite{WangEtAl2022} and references therein contain more information and potential applications of multilinear tensor normal models.
+Multilinear tensor normal %Kronecker-separable covariance 
+models have been used in various applications, including
+medical imaging \parencite{BasserPajevic2007,DrydenEtAl2009}, spatio-temporal data analysis \parencite{GreenewaldHero2014}, regression analysis
+for longitudinal relational data \parencite{Hoff2015}.
+%, radar [AFJ10], and multiple-input-multiple-output (MIMO) communications [WJS08]. 
+One of the most important uses of the multilinear normal (MLN) distribution, and hence tensor analysis, is perhaps in magnetic resonance imaging (MRI)  \parencite{OhlsonEtAl2013}.
+A recent survey \parencite{WangEtAl2022} and references therein contain more information and potential applications of multilinear tensor normal models.
+%The first occurrence of the \textit{matrix normal} we found, even though not explicitly called as such, was in \textcite{SrivastavaKhatri1979}. 

 Suppose $\ten{X}\mid Y = y$ follows a tensor normal distribution  with mean $\ten{\mu}_y$ and covariance $\mat{\Sigma} = \bigkron_{k = r}^{1}\mat{\Sigma}_k$. We assume the distribution is non-degenerate which means that the covariances $\mat{\Sigma}_k$ are symmetric positive definite matrices. Its density is given by
 \begin{displaymath}
@ -528,7 +558,7 @@ where we used that $\overline{\ten{\eta}} = 0$ due to $0 = \E\ten{X} = \E\E[\ten
    \ten{g}_1(\mat{\eta}_y) = \E[\ten{X}\mid Y = y] = \ten{\mu}_y, \qquad
    \ten{G}_2(\mat{\eta}_y) = \ten{g}_2(\mat{\eta}_y) = \E[\ten{X}\circ\ten{X}\mid Y = y] \equiv \bigkron_{k = r}^1\mat{\Sigma}_k + (\vec{\ten{\mu}}_y)\t{(\vec{\ten{\mu}}_y)}.
 \end{displaymath}
-In practice, we assume we have a random sample of $n$ observations $(\ten{X}_i, \ten{F}_{y_i})$ from the joint distribution. We start the estimation process by demeaning them. Then, only the reduction matrices $\mat{\beta}_k$ and the scatter matrices $\mat{\Omega}_k$ need to be estimated. To solve the optimization problem \eqref{eq:mle}, with $\overline{\ten{\eta}} = 0$ we initialize  the parameters using a simple heuristic approach. % For initial estimates $\hat{\mat{\beta}}_k^{(0)}$ we  
+In practice, we assume we have a random sample of $n$ observations $(\ten{X}_i, \ten{F}_{y_i})$ from the joint distribution. We start the estimation process by demeaning the data. Then, only the reduction matrices $\mat{\beta}_k$ and the scatter matrices $\mat{\Omega}_k$ need to be estimated. To solve the optimization problem \eqref{eq:mle}, with $\overline{\ten{\eta}} = 0$ we initialize  the parameters using a simple heuristic approach. % For initial estimates $\hat{\mat{\beta}}_k^{(0)}$ we  
 First, we compute moment based mode-wise marginal covariance estimates $\widehat{\mat{\Sigma}}_k(\ten{X})$ and $\widehat{\mat{\Sigma}}_k(\ten{F}_Y)$ as
 \begin{displaymath}
    \widehat{\mat{\Sigma}}_k(\ten{X})   = \frac{1}{n}\sum_{i = 1}^{n} (\ten{X}_i)_{(k)}\t{(\ten{X}_i)_{(k)}}, \qquad
@ -538,7 +568,7 @@ Then, for every mode $k = 1, \ldots, r$, we compute the first $j = 1, \ldots, q_
 \begin{align*}
    \mat{U}_k &= (\mat{v}_1(\widehat{\mat{\Sigma}}_1(\ten{X})), \ldots, \mat{v}_{q_k}(\widehat{\mat{\Sigma}}_{q_k}(\ten{X}))), \\
    \mat{D}_k &= \diag(\mat{v}_1(\widehat{\mat{\Sigma}}_1(\ten{X}))\mat{v}_1(\widehat{\mat{\Sigma}}_1(\ten{F}_{Y})), \ldots, \mat{v}_{q_k}(\widehat{\mat{\Sigma}}_{q_k}(\ten{X}))\mat{v}_{q_k}(\widehat{\mat{\Sigma}}_k(\ten{F}_{Y}))), \\
-    \mat{V}_k &= (\mat{v}_1(\widehat{\mat{\Sigma}}_1(\ten{F}_Y), \ldots, \mat{v}_{q_k}(\widehat{\mat{\Sigma}}_{q_k}(\ten{F}_Y)). \\
+    \mat{V}_k &= (\mat{v}_1(\widehat{\mat{\Sigma}}_1(\ten{F}_Y), \ldots, \mat{v}_{q_k}(\widehat{\mat{\Sigma}}_{q_k}(\ten{F}_Y)). 
 \end{align*}
 The initial value of $\mat{\beta}_k$  is
 \begin{displaymath}
@ -561,7 +591,7 @@ Given  $\hat{\mat{\beta}}_1, \ldots, \hat{\mat{\beta}}_r, \hat{\mat{\Omega}}_1,
        \hat{\mat{\Omega}}_j.
 \end{equation}
 %For the scatter matrices $\mat{\Omega}_j$, we need to fudge a bit. 
-Equating the partial gradient of the $j$th scatter matrix $\mat{\Omega}_j$ in \cref{thm:grad} to zero ( $\nabla_{\mat{\Omega}_j}l_n = 0$) gives a quadratic matrix equation. This is due to the dependence of $\ten{\mu}_y$ on $\mat{\Omega}_j$. In practice though, it is faster, more stable,  and equally accurate to use mode-wise covariance estimates via the residuals
+Equating the partial gradient of the $j$th scatter matrix $\mat{\Omega}_j$ in \cref{thm:grad} to zero ( $\nabla_{\mat{\Omega}_j}l_n = 0$) gives a quadratic matrix equation due to the dependence of $\ten{\mu}_y$ on $\mat{\Omega}_j$. In practice though, it is faster, more stable,  and equally accurate to use mode-wise covariance estimates via the residuals
 \begin{displaymath}
    \hat{\ten{R}}_i = \ten{X}_i - \hat{\ten{\mu}}_{y_i} = \ten{X}_i - \ten{F}_{y_i}\mlm_{k = 1}^{r}\hat{\mat{\Omega}}_k^{-1}\hat{\mat{\beta}}_k.
 \end{displaymath} 
@ -569,7 +599,7 @@ The estimates are computed via
 \begin{displaymath}
    \tilde{\mat{\Sigma}}_j = \sum_{i = 1}^n (\hat{\ten{R}}_i)_{(j)} \t{(\hat{\ten{R}}_i)_{(j)}},
 \end{displaymath}
-where $\tilde{s}\tilde{\mat{\Sigma}}_j = \hat{\mat{\Omega}}_j^{-1}$. For scaling  we use that the mean squared error has to be equal to the trace of the covariance estimate,
+where $\tilde{s}\tilde{\mat{\Sigma}}_j = \hat{\mat{\Omega}}_j^{-1}$. To decide on the scaling factor $\tilde{s}$  we use that the mean squared error has to be equal to the trace of the covariance estimate,
 \begin{displaymath}
    \frac{1}{n}\sum_{i = 1}^n \langle \hat{\ten{R}}_i, \hat{\ten{R}}_i \rangle = \tr\bigkron_{k = r}^{1}\hat{\mat{\Omega}}_k^{-1} = \prod_{k = 1}^{r}\tr{\hat{\mat{\Omega}}_k^{-1}} = \tilde{s}^r\prod_{k = 1}^{r}\tr{\tilde{\mat{\Sigma}}_k},
 \end{displaymath}
@ -580,43 +610,23 @@ so that
 resulting in the estimates $\hat{\mat{\Omega}}_j = (\tilde{s}\tilde{\mat{\Sigma}}_j)^{-1}$.
 Estimation is then performed by updating  the estimates $\hat{\mat{\beta}}_j$ via \eqref{eq:tensor_normal_beta_solution} for $j = 1, \ldots, r$,  and then recompute the $\hat{\mat{\Omega}}_j$ estimates simultaneously keeping the $\hat{\mat{\beta}}_j$'s fixed. This procedure is repeated until convergence. % Convergence is very fast, experiments showed that convergence occures usualy in less than $10$ iterations.

-A technical detail for numerical stability is to ensure that the scaled values $\tilde{s}\tilde{\mat{\Sigma}}_j$, assumed to be symmetric and positive definite, are well conditioned. Thus, we estimate the condition number of $\tilde{s}\tilde{\mat{\Sigma}}_j$ prior to computing the inverse. In case of  ill- conditioning, we use the regularized $\hat{\mat{\Omega}}_j = (\tilde{s}\tilde{\mat{\Sigma}}_j + 0.2 \lambda_{1}(\tilde{s}\tilde{\mat{\Sigma}}_j)\mat{I}_{p_j})^{-1}$ instead, where $\lambda_{1}(\tilde{s}\tilde{\mat{\Sigma}}_j)$ is the first (maximum) eigenvalue. Experiments showed that this regularization is usually only required in the first few iterations.
+A technical detail for numerical stability is to ensure that the scaled values $\tilde{s}\tilde{\mat{\Sigma}}_j$, assumed to be symmetric and positive definite, are well conditioned. Thus, we estimate the condition number of $\tilde{s}\tilde{\mat{\Sigma}}_j$ before computing the inverse. In case of ill-conditioning, we use the regularized $\hat{\mat{\Omega}}_j = (\tilde{s}\tilde{\mat{\Sigma}}_j + 0.2 \lambda_{1}(\tilde{s}\tilde{\mat{\Sigma}}_j)\mat{I}_{p_j})^{-1}$ instead, where $\lambda_{1}(\tilde{s}\tilde{\mat{\Sigma}}_j)$ is the first (maximum) eigenvalue. Experiments showed that this regularization is usually only required in the first few iterations.

-Furthermore, if the parameter space follows a more general setting as in \cref{thm:param-manifold}, updating may produces estimates outside  the parameter space. A simple and efficient method is to project every updated estimate onto the corresponding manifold.
+Furthermore, if the parameter space follows a more general setting as in \cref{thm:param-manifold}, updating may produce estimates outside the parameter space. A simple and efficient method is to project every updated estimate onto the corresponding manifold.

 A standard method to compute the MLE of a Kronecker product is block-coordinate descent, also
 referred to as the ``flip-flop algorithm.'' This algorithm was proposed independently by \textcite{MardiaGoodall1993} and \textcite{Dutilleul1999} and was later called ``flip-flop'' algorithm by \textcite{LuZimmerman2005} for the computation of the maximum likelihood estimators of the components of a separable covariance matrix. \textcite{ManceurDutilleul2013} extended the ``flip-flop'' algorithm for the computation of the MLE of the separable covariance structure of a 3-way and 4-way normal distribution and obtained a lower bound for the sample size required for its existence. The same issue was also studied by \textcite{DrtonEtAl2020} in the case of a two-way array (matrix). Our algorithm uses a
 similar “flip-flop” approach by iteratively updating the $\mat{\beta}_k$'s and $\mat{\Omega}_k$'s, one after the other.


-\subsubsection{Matrix and Tensor Normal}
-
-The tensor normal model is the extension of the matrix normal to tensor-valued random variables and a member of the quadratic exponential family \eqref{eq:quadratic-exp-fam} under \eqref{eq:eta2}. \textcite{Dawid1981,Arnold1981} introduced the term matrix normal and, in particular, \textcite{Arnold1981} provided several theoretical results, such as its density, moments and conditional distributions of its components. The matrix normal distribution is  a bilinear normal distribution; a distribution of a two-way array, each component
-representing a vector of observations \parencite{OhlsonEtAl2013}. \textcite{KolloVonRosen2005,Hoff2011,OhlsonEtAl2013} presented the extension of the bilinear to the multilinear normal distribution, what we call tensor normal in this paper, using a parallel extension of bilinear matrices to multilinear tensors \parencite{Comon2009}. 
-
-The defining feature of the matrix normal distribution, and its tensor extension, is the Kronecker product structure of its covariance. This formulation, where the covariates are multivariate normal with multiway covariance structure modeled as a Kronecker
-product of matrices of much lower dimension, aims to overcome the significant modeling and computational challenges arising from the
-high computational complexity of manipulating tensor representations \parencite[see, e.g.,][]{HillarLim2013,WangEtAl2022}. 
-
-Multilinear tensor normal %Kronecker-separable covariance 
-models have been used in various applications, including
-medical imaging \parencite{BasserPajevic2007,DrydenEtAl2009}, spatio-temporal data analysis \parencite{GreenewaldHero2014}, regression analysis
-for longitudinal relational data \parencite{Hoff2015}.
-%, radar [AFJ10], and multiple-input-multiple-output (MIMO) communications [WJS08]. 
-One of the most important uses of the multilinear normal (MLN) distribution, and hence tensor analysis, is perhaps in magnetic resonance imaging (MRI)  \parencite{OhlsonEtAl2013}.
-A recent survey \parencite{WangEtAl2022} and references therein contain more information and potential applications of multilinear tensor normal models.
-%The first occurrence of the \textit{matrix normal} we found, even though not explicitly called as such, was in \textcite{SrivastavaKhatri1979}. 
-
-
-
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{Ising Model}\label{sec:ising_estimation}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-The Ising\footnote{Also known as the \emph{Lenz-Ising} model as the physical assumptions of the model where developed by both Lenz and Ising \parencite{Niss2005}. Ising gave a closed form solution for the 1-dimensional lattice, that is, a linear chain \parencite{Ising1925}.} model \parencite{Lenz1920,Ising1925,Niss2005} is a mathematical model originating in statistical physics to study ferromagnetism in a thermodynamic setting. It describes magentic dipoles (atomic ``spins'') which can take two states ($\pm 1$) while allowing two-way interactions between direct neighbours on a lattice, a discrete grid. The model assumes all elementary magnets to be the same, which translates to all having the same coupling strength (two-way interactions) governed by a single parameter relating to the temperature of the system. Nowadays, the Ising model, in its general form, allows for different coupling strength for every (symmetric) interaction as well as an external magnetic field acting on every magnetic dipole separately. A modern review is given by \textcite{NguyenEtAl2017}.
+The Ising\footnote{Also known as the \emph{Lenz-Ising} model as the physical assumptions of the model where developed by both Lenz and Ising \parencite{Niss2005}. Ising gave a closed form solution for the 1-dimensional lattice, that is, a linear chain \parencite{Ising1925}.} model \parencite{Lenz1920,Ising1925,Niss2005} is a mathematical model originating in statistical physics to study ferromagnetism in a thermodynamic setting. It describes magentic dipoles (atomic ``spins'') which can take two states ($\pm 1$) while allowing two-way interactions between direct neighbours on a lattice, a discrete grid. The model assumes all elementary magnets to be the same, which translates to all having the same coupling strength (two-way interactions) governed by a single parameter relating to the temperature of the system. Nowadays, the Ising model, in its general form, allows for different coupling strength for every (symmetric) interaction as well as an external magnetic field acting on every magnetic dipole separately. A review is given by \textcite{NguyenEtAl2017}.

 In statistics, the Ising model is used to model multivariate binary data. That is, the states are ${0, 1}$ instead of $\pm 1$. It is related to a multitude of other models; \emph{Graphical Models} and \emph{Markov Random Fields} to describe conditional dependence \parencite{Lauritzen1996,WainwrightJordan2008,LauritzenRichardson2002}, \emph{Potts models} \parencite{Besag1974,ChakrabortyEtAl2022} which generalize the Ising model to multiple states, the \emph{multivariate Bernoulli distribution} \parencite{Whittaker1990,JohnsonEtAl1997,DaiDingWahba2013} considering also interactions (tree-way and higher), to give the most prominent. 

-The $p$-dimensional Ising model is a discrete probability distribution on the set of $p$-dimensional binary vectors $\mat{x}\in\{0, 1\}^p$ with pmf given by
+The $p$-dimensional Ising model is a discrete probability distribution on the set of $p$-dimensional binary vectors $\mat{x}\in\{0, 1\}^p$ with probability mass function (pmf) given by
 \begin{displaymath}
    P_{\mat{\gamma}}(\mat{x}) = p_0(\mat{\gamma})\exp(\t{\vech(\mat{x}\t{\mat{x}})}\mat{\gamma}).
 \end{displaymath}
@ -624,20 +634,20 @@ The scaling factor $p_0(\mat{\gamma})\in\mathbb{R}_{+}$ ensures that $P_{\mat{\g
 \begin{equation}\label{eq:ising-partition-function}
    p_0(\mat{\gamma})^{-1} = \sum_{\mat{x}\in\{0, 1\}^p}\exp(\t{\vech(\mat{x}\t{\mat{x}})}\mat{\gamma}).
 \end{equation}
-By an abuse of notation, we let $\mat{\gamma}_{j l}$ denote the element of $\mat{\gamma}$ corresponding to $\mat{x}_j\mat{x}_l$ in $\vech(\mat{x}\t{\mat{x}})$\footnote{Specifically, the element $\mat{\gamma}_{j l}$ of $\mat{\gamma}$ is a short hand for $\mat{\gamma}_{\iota(j, l)}$ with $\iota(j, l) = (\min(j, l) - 1)(2 p - \min(j, l)) / 2 + \max(j, l)$ mapping the matrix row index $j$ and column index $l$ to the corresponding half vectorization indices $\iota(j, l)$.}. The ``diagonal'' parameter $\mat{\gamma}_{j j}$ expresses the conditional log odds of $X_j = 1\mid X_{-j} = \mat{0}$, where the negative subscript in $X_{-j}$ describes the $p - 1$ dimensional vector $X$ with the $j$th element removed. The off diagonal entries $\mat{\gamma}_{j l}$, $j\neq l$, are equal to the conditional log odds of simultaneous occurrence $X_j = 1, X_l = 1 \mid X_{-j, -l} = \mat{0}$. More precisely, the conditional probabilities and the natural parameters are related via
+Abusing notation, we let $\mat{\gamma}_{j l}$ denote the element of $\mat{\gamma}$ corresponding to $\mat{x}_j\mat{x}_l$ in $\vech(\mat{x}\t{\mat{x}})$\footnote{Specifically, the element $\mat{\gamma}_{j l}$ of $\mat{\gamma}$ is a short hand for $\mat{\gamma}_{\iota(j, l)}$ with $\iota(j, l) = (\min(j, l) - 1)(2 p - \min(j, l)) / 2 + \max(j, l)$ mapping the matrix row index $j$ and column index $l$ to the corresponding half vectorization indices $\iota(j, l)$.}. The ``diagonal'' parameter $\mat{\gamma}_{j j}$ expresses the conditional log odds of $X_j = 1\mid X_{-j} = \mat{0}$, where the negative subscript in $X_{-j}$ describes the $p - 1$ dimensional vector $X$ with the $j$th element removed. The off diagonal entries $\mat{\gamma}_{j l}$, $j\neq l$, are equal to the conditional log odds of simultaneous occurrence $X_j = 1, X_l = 1 \mid X_{-j, -l} = \mat{0}$. More precisely, the conditional probabilities and the natural parameters are related via
 \begin{align}
    \mat{\gamma}_{j j} &= \log\frac{P_{\mat{\gamma}}(X_j = 1\mid X_{-j} = \mat{0})}{1 - P_{\mat{\gamma}}(X_j = 1\mid X_{-j} = \mat{0})}, \nonumber \\
    \mat{\gamma}_{j l} &= \log\frac{1 - P_{\mat{\gamma}}(X_j = 1\mid X_{-j} = \mat{0})P_{\mat{\gamma}}(X_l = 1\mid X_{-l} = \mat{0})}{P_{\mat{\gamma}}(X_j = 1\mid X_{-j} = \mat{0})P_{\mat{\gamma}}(X_l = 1\mid X_{-l} = \mat{0})}\frac{P_{\mat{\gamma}}(X_j = 1, X_l = 1\mid X_{-j, -l} = \mat{0})}{1 - P_{\mat{\gamma}}(X_j = 1, X_l = 1\mid X_{-j, -l} = \mat{0})} \label{eq:ising-two-way-log-odds}.
 \end{align}
 Conditional Ising models, incorporating the information of covariates $Y$ into the model, were considered by \textcite{ChengEtAl2014,BuraEtAl2022}. The direct way is to parameterize $\mat{\gamma} = \mat{\gamma}_y$ by the covariate $Y = y$ to model a conditional distribution $P_{\mat{\gamma}_y}(\mat{x}\mid Y = y)$.

-We extend the conditional pmf by allowing the binary variables to be tensor valued; that is,  we set $\mat{x} = \vec{\ten{X}}$, with dimension $p = \prod_{k = 1}^{r}p_k$ for $\ten{X}\in\{ 0, 1 \}^{p_1\times\cdots\times p_r}$. The tensor structure of $\ten{X}$ is accommodated  by assuming Kronecker product constraints to the parameter vector $\mat{\gamma}_y$ in a similar fashion as for the tensor normal model. This means that we compare the pmf $P_{\mat{\gamma}_y}(\vec{\ten{X}} | Y = y)$ with the quadratic exponential family \eqref{eq:quadratic-exp-fam} with the natural parameters modeled by \eqref{eq:eta1} and \eqref{eq:eta2}. A detail to be considered is that the diagonal of $(\vec{\ten{X}})\t{(\vec{\ten{X}})}$ is equal to $\vec{\ten{X}}$. This gives the GMLM model as
+We extend the conditional pmf by allowing the binary variables to be tensor valued; that is,  we set $\mat{x} = \vec{\ten{X}}$, with dimension $p = \prod_{k = 1}^{r}p_k$ for $\ten{X}\in\{ 0, 1 \}^{p_1\times\cdots\times p_r}$. The tensor structure of $\ten{X}$ is accommodated  by assuming Kronecker product constraints to the parameter vector $\mat{\gamma}_y$ in a similar fashion as for the tensor normal model. This means that we compare the pmf $P_{\mat{\gamma}_y}(\vec{\ten{X}} | Y = y)$ with the quadratic exponential family \eqref{eq:quadratic-exp-fam} with the natural parameters modeled by \eqref{eq:eta1} and \eqref{eq:eta2}. A detail to be considered is that the diagonal of $(\vec{\ten{X}})\t{(\vec{\ten{X}})}$ is equal to $\vec{\ten{X}}$, which results in the GMLM being expressed as
 \begin{align}
    P_{\mat{\gamma}_y}(\ten{X} \mid Y = y)
        &= p_0(\mat{\gamma}_y)\exp(\t{\vech((\vec{\ten{X}})\t{(\vec{\ten{X}})})}\mat{\gamma}_y) \nonumber \\
        &= p_0(\mat{\gamma}_y)\exp\Bigl(\Bigl\langle \ten{X}, \ten{F}_y\mlm_{k = 1}^{r}\mat{\beta}_k \Bigr\rangle + \Bigl\langle\ten{X}\mlm_{k = 1}^{r}\mat{\Omega}_k, \ten{X}\Bigr\rangle\Bigr)\label{eq:ising-cond-prob}
 \end{align}
-where we set $\overline{\ten{\eta}} = 0$ and $\mat{T}_2$ to the identity. This imposes an additional constraint to the model, the reason is that the diagonal elements of $\mat{\Omega} = \bigkron_{k = r}^{1}\mat{\Omega}_k$ take the role of $\overline{\ten{\eta}}$, although not fully. Having the diagonal of $\mat{\Omega}$ and $\overline{\ten{\eta}}$ handling the self interaction effects might lead to interference in the optimization routine. Another approach would be to use the $\mat{T}_2$ matrix to set the corresponding diagonal elements of $\mat{\Omega}$ to zero and let $\overline{\ten{\eta}}$ handle the self interaction effect. All of those approaches, namely setting $\overline{\ten{\eta}} = 0$, keeping $\overline{\ten{\eta}}$ or using $\mat{T}_2$, are theoretically solid and compatible with \cref{thm:grad,thm:param-manifold,thm:asymptotic-normality-gmlm}, assuming all axis dimensions $p_k$ are non-degenerate, that is $p_k > 1$ for all $k = 1, \ldots, r$. Regardless, under our modeling choice the relation between the natural parameters $\mat{\gamma}_y$ of the conditional Ising model and the GMLM parameters $\mat{\beta}_k$ and $\mat{\Omega}_k$ is
+where we set $\overline{\ten{\eta}} = 0$ and $\mat{T}_2$ to the identity. This imposes an additional constraint to the model, the reason is that the diagonal elements of $\mat{\Omega} = \bigkron_{k = r}^{1}\mat{\Omega}_k$ take the role of $\overline{\ten{\eta}}$, although not fully. Having the diagonal of $\mat{\Omega}$ and $\overline{\ten{\eta}}$ handling the self interaction effects might lead to interference in the optimization routine. Another approach would be to use the $\mat{T}_2$ matrix to set the corresponding diagonal elements of $\mat{\Omega}$ to zero and let $\overline{\ten{\eta}}$ handle the self interaction effect. All of these approaches, namely setting $\overline{\ten{\eta}} = 0$, keeping $\overline{\ten{\eta}}$ or using $\mat{T}_2$, are theoretically solid and compatible with \cref{thm:grad,thm:param-manifold,thm:asymptotic-normality-gmlm}, assuming all axis dimensions $p_k$ are non-degenerate, that is $p_k > 1$ for all $k = 1, \ldots, r$. Regardless, under our modeling choice, the relation between the natural parameters $\mat{\gamma}_y$ of the conditional Ising model and the GMLM parameters $\mat{\beta}_k$ and $\mat{\Omega}_k$ is
 \begin{equation}\label{eq:ising-natural-params}
    % \t{\pinv{\mat{D}_p}}\mat{\gamma}_y
    %     = \vec(\mat{\Omega} + \diag(\mat{B}\vec{\ten{F}_y}))
@ -648,41 +658,41 @@ where we set $\overline{\ten{\eta}} = 0$ and $\mat{T}_2$ to the identity. This i
 \end{equation}
 In contract to the tensor normal GMLM, the matrices $\mat{\Omega}_k$ are only required to be symmetric. More specifically, we require $\mat{\Omega}_k$, for $k = 1, \ldots, r$, to be elements of an embedded submanifold of $\SymMat{p_k}$ (see: \cref{sec:kron-manifolds,sec:matrix-manifolds}). The mode wise reduction matrices $\mat{\beta}_k$ are elements of an embedded submanifold of $\mathbb{R}^{p_k\times q_k}$. Common choices are listed in \cref{sec:matrix-manifolds}.

-To solve the optimization problem \eqref{eq:mle}, given a data set $(\ten{X}_i, y_i)$, for $i = 1, \ldots, n$, we use a variation of gradient descent.
+To solve the optimization problem \eqref{eq:mle}, given a data set $(\ten{X}_i, y_i)$,  $i = 1, \ldots, n$, we use a variation of gradient descent.

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsubsection{Initial Values}

-The first step is to get reasonable starting values. Experiments showed that a good starting value of $\mat{\beta}_k$, for $k = 1, \ldots, r$, it to use the tensor normal estimates from \cref{sec:tensor-normal-estimation}, considering $\ten{X}_i$ as continuous. For initial values of $\mat{\Omega}_k$, a different approach is required. Setting everything to the uninformed initial value, that is $\mat{\Omega}_k = \mat{0}$ as this corresponds to the conditional log odds to be $1:1$ for every component and pairwaide interaction. This is not possible, since $\mat{0}$ is a stationary point of the log-likelihood. This is directly observed by taking a look at the partial gradients of the log-likelihood in \cref{thm:grad}. Instead, we use a crude heuristic which threads every mode seperately and ignores any relation to the covariates. It is computationaly cheap and better than any of the alternatives we considered. For every $k = 1, \ldots, r$, let the $k$'th mode second moment estimate be
+The first step is to get reasonable starting values. Experiments showed that a good starting value of $\mat{\beta}_k$ is to use the tensor normal estimates from \cref{sec:tensor-normal-estimation} for $k = 1, \ldots, r$, considering $\ten{X}_i$ as continuous. For initial values of $\mat{\Omega}_k$, a different approach is required. Setting everything to the uninformed initial value, that is $\mat{\Omega}_k = \mat{0}$ as this corresponds to the conditional log odds to be $1:1$ for every component and pairwise interaction. This is not possible, since $\mat{0}$ is a stationary point of the log-likelihood. This is directly observed by considering the partial gradients of the log-likelihood in \cref{thm:grad}. Instead, we use a crude heuristic which threads every mode seperately and ignores any relation to the covariates. It is computationaly cheap and better than any of the alternatives we considered. For every $k = 1, \ldots, r$, let the $k$th mode second moment estimate be
 \begin{equation}\label{eq:ising-mode-moments}
    \hat{\mat{M}}_{2(k)} = \frac{p_k}{n p}\sum_{i = 1}^n (\ten{X}_i)_{(k)}\t{(\ten{X}_i)_{(k)}}
 \end{equation}
-which contains the $k$'th mode first moment estimate in its diagonal $\hat{\mat{M}}_{1(k)} = \diag\hat{\mat{M}}_{2(k)}$. Considering every column of the matricized observation $(\ten{X}_i)_{(k)}$ as a $p_k$ dimensional observation itself. The number of those artificially generated observations is $n \prod_{j\neq k}p_j$. Let $Z_k$ denote the random variable those artificial observations are realization of. Then, we can interpret the elements $(\hat{\mat{M}}_{1(k)})_{j}$ as the estimates of the probability $P((Z_k)_j = 1)$, that is the marginal probability of the $j$th element of $Z_k$ being $1$. Similar, for $l \neq j$ we have $(\hat{\mat{M}}_{2(k)})_{j l}$ estimating $P((Z_k)_j = 1, (Z_k)_l = 1)$, the marginal probability of two-way interactions. % Without any regard of accuracy ...
-Now, we set the diagonal elements of $\mat{\Omega}_k$ to zero. For the off diagonal elements of $\mat{\Omega}_k$, we equate the conditional probabilities $P((Z_k)_j = 1 \mid (Z_k)_{-j} = \mat{0})$ and $P((Z_k)_j = 1, (Z_k)_l = 1\mid (Z_k)_{-j, -l} = \mat{0})$ with the marginal probability estimates $(\hat{\mat{M}}_{1(k)})_{j}$ and $(\hat{\mat{M}}_{2(k)})_{j l}$, respectively. Use \eqref{eq:ising-two-way-log-odds} then gives the initial estimates $\hat{\mat{\Omega}}_k^{(0)}$, with $j \neq l$ component wise
+which contains the $k$th mode first moment estimate in its diagonal $\hat{\mat{M}}_{1(k)} = \diag\hat{\mat{M}}_{2(k)}$. Considering every column of the matricized observation $(\ten{X}_i)_{(k)}$ as a $p_k$ dimensional observation. The number of those artificially generated observations is $n \prod_{j\neq k}p_j$. Let $Z_k$ denote the random variable those artificial observations are realization of. Then, we can interpret the elements $(\hat{\mat{M}}_{1(k)})_{j}$ as the estimates of the probability $P((Z_k)_j = 1)$, that is the marginal probability of the $j$th element of $Z_k$ being $1$. Similar, for $l \neq j$ we have $(\hat{\mat{M}}_{2(k)})_{j l}$ estimating $P((Z_k)_j = 1, (Z_k)_l = 1)$, the marginal probability of two-way interactions. % Without any regard of accuracy ...
+Now, we set the diagonal elements of $\mat{\Omega}_k$ to zero. For the off diagonal elements of $\mat{\Omega}_k$, we equate the conditional probabilities $P((Z_k)_j = 1 \mid (Z_k)_{-j} = \mat{0})$ and $P((Z_k)_j = 1, (Z_k)_l = 1\mid (Z_k)_{-j, -l} = \mat{0})$ with the marginal probability estimates $(\hat{\mat{M}}_{1(k)})_{j}$ and $(\hat{\mat{M}}_{2(k)})_{j l}$, respectively. Applying \eqref{eq:ising-two-way-log-odds} then gives the initial component-wise estimates $\hat{\mat{\Omega}}_k^{(0)}$, 
 \begin{equation}\label{eq:ising-init-Omegas}
    (\hat{\mat{\Omega}}_k^{(0)})_{j j} = 0,
    \qquad
-    (\hat{\mat{\Omega}}_k^{(0)})_{j l} = \log\frac{1 - (\hat{\mat{M}}_{1(k)})_{j}(\hat{\mat{M}}_{1(k)})_{l}}{(\hat{\mat{M}}_{1(k)})_{j}(\hat{\mat{M}}_{1(k)})_{l}}\frac{(\hat{\mat{M}}_{2(k)})_{j l}}{1 - (\hat{\mat{M}}_{2(k)})_{j l}}.
+    (\hat{\mat{\Omega}}_k^{(0)})_{j l} = \log\frac{1 - (\hat{\mat{M}}_{1(k)})_{j}(\hat{\mat{M}}_{1(k)})_{l}}{(\hat{\mat{M}}_{1(k)})_{j}(\hat{\mat{M}}_{1(k)})_{l}}\frac{(\hat{\mat{M}}_{2(k)})_{j l}}{1 - (\hat{\mat{M}}_{2(k)})_{j l}}, \, j \neq l.
 \end{equation}

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsubsection{Gradient Optimization}
-Given initial values, the gradients provided by \cref{thm:grad} can be evaluated for the Ising model. The first step therefore is to determine the values of the inverse link components $\ten{g}_1(\mat{\gamma}_y) = \E[\ten{X} \mid Y = y]$ and $\ten{G}_2(\mat{\gamma}_y) = \ten{g}_2(\mat{\gamma}_y) = \E[\ten{X}\circ\ten{X} \mid Y = y]$. An immediate simplification is that the first moment is a part of the second moment. Its values are determined via $\vec(\E[\ten{X} \mid Y = y]) = \diag(\E[\ten{X}\circ\ten{X} \mid Y = y]_{(1, \ldots, r)})$. This means only the second moment needs to be computed, or estimated (see: \cref{sec:ising-bigger-dim}) in the case of slightly bigger $p$. For the Ising model, the conditional second moment with parameters $\mat{\gamma}_y$ is given by the matricized relation
+Given initial values, the gradients derived in  \cref{thm:grad} can be evaluated for the Ising model. The first step therefore is to determine the values of the inverse link components $\ten{g}_1(\mat{\gamma}_y) = \E[\ten{X} \mid Y = y]$ and $\ten{G}_2(\mat{\gamma}_y) = \ten{g}_2(\mat{\gamma}_y) = \E[\ten{X}\circ\ten{X} \mid Y = y]$. An immediate simplification is that the first moment is a part of the second moment. Its values are determined via $\vec(\E[\ten{X} \mid Y = y]) = \diag(\E[\ten{X}\circ\ten{X} \mid Y = y]_{(1, \ldots, r)})$. This means only the second moment needs to be computed, or estimated (see: \cref{sec:ising-bigger-dim}) in the case of slightly bigger $p$. For the Ising model, the conditional second moment with parameters $\mat{\gamma}_y$ is given by the matricized relation
 \begin{equation}\label{eq:ising-m2}
-    \ten{g}_2(\ten{\gamma}_y)_{(1, \ldots, r)} = \E[(\vec{\ten{X}})\t{(\vec{\ten{X}})}\mid Y = y] = p_0(\mat{\gamma}_y)\sum_{\mat{x}\in\{0, 1\}^{p}}\mat{x}\t{\mat{x}}\exp(\t{\vech(\mat{x}\t{\mat{x}})}\mat{\gamma}_y).
+    \ten{g}_2(\ten{\gamma}_y)_{(1, \ldots, r)} = \E\left[(\vec{\ten{X}})\t{(\vec{\ten{X}})}\mid Y = y\right] = p_0(\mat{\gamma}_y)\sum_{\mat{x}\in\{0, 1\}^{p}}\mat{x}\t{\mat{x}}\exp(\t{\vech(\mat{x}\t{\mat{x}})}\mat{\gamma}_y).
 \end{equation}
-The natural parameter $\mat{\gamma}_y$ is evaluated via \eqref{eq:ising-natural-params} enabling us to compute the partial gradients of the log-likelihood $l_n$ \eqref{eq:log-likelihood} for the Ising model by \cref{thm:grad} for the GMLM parameters $\mat{\beta}_k$ and $\mat{\Omega}_k$, $k = 1, \ldots, r$, at the current iterate $\mat{\theta}^{(I)} = (\mat{\beta}_1^{(I)}, \ldots, \mat{\beta}_r^{(I)}, \mat{\Omega}_1^{(I)}, \ldots, \mat{\Omega}_r^{(I)})$. Using classic gradient ascent for maximizing the log-likelihood, we have to specify a learning rate $\lambda\in\mathbb{R}_{+}$, usualy something like $10^{-3}$. The update rule is
+The natural parameter $\mat{\gamma}_y$ is evaluated via \eqref{eq:ising-natural-params} enabling us to compute the partial gradients of the log-likelihood $l_n$ \eqref{eq:log-likelihood} for the Ising model by \cref{thm:grad} for the GMLM parameters $\mat{\beta}_k$ and $\mat{\Omega}_k$, $k = 1, \ldots, r$, at the current iterate $\mat{\theta}^{(I)} = (\mat{\beta}_1^{(I)}, \ldots, \mat{\beta}_r^{(I)}, \mat{\Omega}_1^{(I)}, \ldots, \mat{\Omega}_r^{(I)})$. Using classic gradient ascent for maximizing the log-likelihood, we have to specify a learning rate $\lambda\in\mathbb{R}_{+}$, usually a value close to $10^{-3}$. The update rule is
 \begin{displaymath}
    \mat{\theta}^{(I + 1)} = \mat{\theta}^{(I)} + \lambda\nabla_{\mat{\theta}} l_n(\mat{\theta})\bigr|_{\mat{\theta} = \mat{\theta}^{(I)}},
 \end{displaymath}
-which is iterated till convergence. In practice, iteration is performed until ether a maximum number of iterations is exhausted and/or some break condition is satisfied. A proper choice of the learning rate is needed as a too large learning rate $\lambda$ causes instabilities, while a too low learning rate requires an enormous amount of iterations. Generically, there are two approach against the need to determine a proper learning rate. First, \emph{line search methods} determin an appropriate step size for every iteration. This works great if the evaluation of the object function (the log-likelihood) is cheap. This is not the case in our setting, see \cref{sec:ising-bigger-dim}. The second approach is an \emph{adaptive learning rate}. The basic idea is to track specific statistics while optimizing and dynamically adapt the leaning rate via well tested heuristics using the gathered knowledge from past iterations. We opted to use an adaptive leaning rate approach, this not only removes the need to determine an appropriate leaning rate, but also  accelerates learning.
+which is iterated till convergence. In practice, iteration is performed until either a maximum number of iterations is exhausted and/or some break condition is satisfied. A proper choice of the learning rate is needed as a large learning rate $\lambda$ may cause instability, while a very low learning rate requires an enormous amount of iterations. Generically, there are two approaches to avoid the need to determine a proper learning rate. First, \emph{line search methods} determine an appropriate step size for every iteration. This works well if the evaluation of the object function (the log-likelihood) is cheap. This is not the case in our setting, see \cref{sec:ising-bigger-dim}. The second approach is an \emph{adaptive learning rate}, where one tracks specific statistics while optimizing and dynamically adapting the learning rate via well-tested heuristics using the gathered knowledge from past iterations. We opted to use an adaptive learning rate approach, which not only removes the need to determine an appropriate learning rate, but also accelerates learning.

-Our method of choice is RMSprop, which stands for \emph{root mean squared propagation} \textcite{Hinton2012}. This is a well known method in machine learning for training neural networks. It is a variation of gradient descent with a per scalar parameter adaptive learning rate. It tracks a moving average of the element wise squared gradient $\mat{g}_2^{(I)}$, which is then used to scale (element wise) the gradient in the update rule. See \textcite{Hinton2012,GoodfellowEtAl2016} among others. The update rule using RMSprop for maximization\footnote{Instead of the more common minimization, therefore $+$ in the update of $\mat{\theta}$.} is
+Our method of choice is \emph{root mean squared propagation} (RMSprop) \parencite{Hinton2012}. This is a well known method in machine learning for training neural networks. It is a variation of gradient descent with a per scalar parameter adaptive learning rate. It tracks a moving average of the element wise squared gradient $\mat{g}_2^{(I)}$, which is then used to scale (element wise) the gradient in the update rule. See \textcite{Hinton2012,GoodfellowEtAl2016} among others. The update rule using RMSprop for maximization\footnote{Instead of the more common minimization, therefore $+$ in the update of $\mat{\theta}$.} is
 \begin{align*}
    \mat{g}_2^{(I + 1)} &= \nu \mat{g}_2^{(I)} + (1 - \nu)\nabla l_n(\mat{\theta}^{(I)})\odot\nabla l_n(\mat{\theta}^{(I)}), \\
    \mat{\theta}^{(I + 1)} &= \mat{\theta}^{(I)} + \frac{\lambda}{\sqrt{\mat{g}_2^{(I + 1)}} + \epsilon}\odot\nabla l_n(\mat{\theta}^{(I)}).
 \end{align*}
-The parameters $\nu = 0.9$, $\lambda = 10^{-3}$ and $\epsilon\approx 1.49\cdot 10^{-8}$ are fixed. The initial value of $\mat{g}_2^{(0)} = \mat{0}$, the symbol $\odot$ denotes the Hadamard product, that is the element wise multiplication. The division and square root operation are performed element wise as well. According to our experiments, RMSprop requires in the range of $50$ till $1000$ iterations till convergence while gradient ascent with a learning rate of $10^{-3}$ is in the range of $1000$ till $10000$.
+The parameters $\nu = 0.9$, $\lambda = 10^{-3}$ and $\epsilon\approx 1.49\cdot 10^{-8}$ are fixed. The initial value of $\mat{g}_2^{(0)} = \mat{0}$, the symbol $\odot$ denotes the Hadamard product, or  element wise multiplication. The division and square root operation are performed element wise as well. According to our experiments, RMSprop requires iterations in the range of $50$ till $1000$ till convergence while gradient ascent with a learning rate of $10^{-3}$ is in the range of $1000$ till $10000$.

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsubsection{Small Data Sets}\label{sec:ising-small-data-sets}
@ -1135,85 +1145,20 @@ Next, going over the PSQTs one by one, a few words about the preferred positions
 The results of our analysis in the previous paragraph agree with the configuration of the chess board most associated with observed chess game outcomes. This arrangement also aligns with the understanding of human chess players of an average configuration at any moment during the game.   

 \section{Discussion}
+We have addressed sufficient dimension reduction for tensor valued predictors for regression or classification problems. Proposing a generalized multilinear model modeling the inverse conditional distribution we provided a multilinear sufficient reduction with consistent and asymptotic normal parameters. Moreover, our ansatz for proving the asymptotic results required by leveraging manifolds as a basis for resolving the issue of unidentifiable parameters lead to an even more flexible modeling framework. This allows to build complex and potentially problem specific parameter spaces incorporating additional domain specific knownledge into the model.

+An additional powerful extension of our model involves considering a sum of separable Kronecker predictors. This is motivated by the equivalence of a Kronecker product to a rank 1 tensor. By allowing a sum of a few separable Kronecker predictors, we remove the implicit rank 1 constraint. However, if this extension is to be applied to the SDR setting, as in this paper, it is crucial to ensure that the sum of Kronecker products forms a parameter manifold to apply our theory. While we anticipate that this approach can lead to intriguing and powerful models, there are certain details that need to be resolved first.
+
+\todo{finish!}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\printbibliography[heading=bibintoc, title={References}]
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \appendix
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

-\section{Examples}\label{app:examples}
-
-\begin{example}[Vectorization]\label{ex:vectorization}
-    Given a matrix
-    \begin{displaymath}
-        \mat{A} = \begin{pmatrix}
-            1 & 4 & 7 \\
-            2 & 5 & 8 \\
-            3 & 6 & 9
-        \end{pmatrix}
-    \end{displaymath}
-    its vectorization is $\vec{\mat{A}} = \t{(1, 2, 3, 4, 5, 6, 7, 8, 9)}$ and its half vectorization $\vech{\mat{A}} = \t{(1, 2, 3, 5, 6, 9)}$. Let a $\ten{A}$ be a tensor of dimension $3\times 3\times 3$ given by
-    \begin{displaymath}
-        \ten{A}_{:,:,1} = \begin{pmatrix}
-            1 & 4 & 7 \\
-            2 & 5 & 8 \\
-            3 & 6 & 9
-        \end{pmatrix},
-        \qquad
-        \ten{A}_{:,:,2} = \begin{pmatrix}
-            10 & 13 & 16 \\
-            11 & 14 & 17 \\
-            12 & 15 & 18
-        \end{pmatrix},
-        \qquad
-        \ten{A}_{:,:,3} = \begin{pmatrix}
-            19 & 22 & 25 \\
-            20 & 23 & 26 \\
-            21 & 24 & 27
-        \end{pmatrix}.
-    \end{displaymath}
-    Then the vectorization of $\ten{A}$ is given by
-    \begin{displaymath}
-        \vec{\ten{A}} = \t{(1, 2, 3, 4, ..., 26, 27)}\in\mathbb{R}^{27}.
-    \end{displaymath}
-\end{example}
-
-\begin{example}[Matricization]
-    Let $\ten{A}$ be the $3\times 4\times 2$ tensor given by
-    \begin{displaymath}
-        \ten{A}_{:,:,1} = \begin{pmatrix}
-            1 & 4 & 7 & 10 \\
-            2 & 5 & 8 & 11 \\
-            3 & 6 & 9 & 12
-        \end{pmatrix},
-        \ten{A}_{:,:,2} = \begin{pmatrix}
-            13 & 16 & 19 & 22 \\
-            14 & 17 & 20 & 23 \\
-            15 & 18 & 21 & 24
-        \end{pmatrix}.
-    \end{displaymath}
-    Its matricizations are then
-    \begin{gather*}
-        \ten{A}_{(1)} = \begin{pmatrix}
-            1 &  4 &  7 & 10 & 13 & 16 & 19 & 22 \\
-            2 &  5 &  8 & 11 & 14 & 17 & 20 & 23 \\
-            3 &  6 &  9 & 12 & 15 & 18 & 21 & 24
-        \end{pmatrix},
-        \qquad
-        \ten{A}_{(2)} = \begin{pmatrix}
-             1 &  2 &  3 & 13 & 14 & 15 \\
-             4 &  5 &  6 & 16 & 17 & 18 \\
-             7 &  8 &  9 & 19 & 20 & 21 \\
-            10 & 11 & 12 & 22 & 23 & 24
-        \end{pmatrix}, \\
-        \ten{A}_{(3)} = \begin{pmatrix}
-             1 &  2 &  3 &  4 &  5 &  6 &  7 &  8 &  9 &  10 &  11 &  12 \\
-            13 & 14 & 15 & 16 & 17 & 18 & 19 & 20 & 21 &  22 &  23 &  24
-        \end{pmatrix}.
-    \end{gather*}
-\end{example}
-
-
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{Tensor Calculus and Multi Linear Algebra}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -1776,9 +1721,77 @@ The following is a technical Lemma used in the proof of \cref{thm:asymptotic-nor
 \end{proof}


+\section{Examples}\label{app:examples}
+
+\begin{example}[Vectorization]\label{ex:vectorization}
+    Given a matrix
+    \begin{displaymath}
+        \mat{A} = \begin{pmatrix}
+            1 & 4 & 7 \\
+            2 & 5 & 8 \\
+            3 & 6 & 9
+        \end{pmatrix}
+    \end{displaymath}
+    its vectorization is $\vec{\mat{A}} = \t{(1, 2, 3, 4, 5, 6, 7, 8, 9)}$ and its half vectorization $\vech{\mat{A}} = \t{(1, 2, 3, 5, 6, 9)}$. Let a $\ten{A}$ be a tensor of dimension $3\times 3\times 3$ given by
+    \begin{displaymath}
+        \ten{A}_{:,:,1} = \begin{pmatrix}
+            1 & 4 & 7 \\
+            2 & 5 & 8 \\
+            3 & 6 & 9
+        \end{pmatrix},
+        \qquad
+        \ten{A}_{:,:,2} = \begin{pmatrix}
+            10 & 13 & 16 \\
+            11 & 14 & 17 \\
+            12 & 15 & 18
+        \end{pmatrix},
+        \qquad
+        \ten{A}_{:,:,3} = \begin{pmatrix}
+            19 & 22 & 25 \\
+            20 & 23 & 26 \\
+            21 & 24 & 27
+        \end{pmatrix}.
+    \end{displaymath}
+    Then the vectorization of $\ten{A}$ is given by
+    \begin{displaymath}
+        \vec{\ten{A}} = \t{(1, 2, 3, 4, ..., 26, 27)}\in\mathbb{R}^{27}.
+    \end{displaymath}
+\end{example}
+
+\begin{example}[Matricization]
+    Let $\ten{A}$ be the $3\times 4\times 2$ tensor given by
+    \begin{displaymath}
+        \ten{A}_{:,:,1} = \begin{pmatrix}
+            1 & 4 & 7 & 10 \\
+            2 & 5 & 8 & 11 \\
+            3 & 6 & 9 & 12
+        \end{pmatrix},
+        \ten{A}_{:,:,2} = \begin{pmatrix}
+            13 & 16 & 19 & 22 \\
+            14 & 17 & 20 & 23 \\
+            15 & 18 & 21 & 24
+        \end{pmatrix}.
+    \end{displaymath}
+    Its matricizations are then
+    \begin{gather*}
+        \ten{A}_{(1)} = \begin{pmatrix}
+            1 &  4 &  7 & 10 & 13 & 16 & 19 & 22 \\
+            2 &  5 &  8 & 11 & 14 & 17 & 20 & 23 \\
+            3 &  6 &  9 & 12 & 15 & 18 & 21 & 24
+        \end{pmatrix},
+        \qquad
+        \ten{A}_{(2)} = \begin{pmatrix}
+             1 &  2 &  3 & 13 & 14 & 15 \\
+             4 &  5 &  6 & 16 & 17 & 18 \\
+             7 &  8 &  9 & 19 & 20 & 21 \\
+            10 & 11 & 12 & 22 & 23 & 24
+        \end{pmatrix}, \\
+        \ten{A}_{(3)} = \begin{pmatrix}
+             1 &  2 &  3 &  4 &  5 &  6 &  7 &  8 &  9 &  10 &  11 &  12 \\
+            13 & 14 & 15 & 16 & 17 & 18 & 19 & 20 & 21 &  22 &  23 &  24
+        \end{pmatrix}.
+    \end{gather*}
+\end{example}

-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\printbibliography[heading=bibintoc, title={References}]
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 \end{document}