effie
This commit is contained in:
parent
8fd80522f0
commit
7a68948d26
2069
LaTeX/main.bib
2069
LaTeX/main.bib
File diff suppressed because it is too large
Load Diff
281
LaTeX/paper.tex
281
LaTeX/paper.tex

@ 281,23 +281,45 @@


\section{Introduction}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%




In Statistics, tensors are a mathematical tool to represent data of complex structure. In this paper, \textit{tensors} are considered as a generalization of matrices to higher dimensions: A tensor is a multidimensional array of numbers. For example, a secondorder tensor can be represented as a matrix, while a thirdorder tensor can be represented as a cube of matrices.


Tensors are a mathematical tool to represent data of complex structure in statistics. \textit{Tensors} are considered as a generalization of matrices to higher dimensions: A tensor is a multidimensional array of numbers. For example, a secondorder tensor can be represented as a matrix, while a thirdorder tensor can be represented as a cube of matrices.




Complex data are collected at different times and/or under several conditions often involving a large number of multiindexed variables represented as tensorvalued data \parencite{KoldaBader2009}. They occur in largescale longitudinal studies \parencite[e.g.][]{Hoff2015}, in agricultural experiments and chemometrics and spectroscopy \parencite[e.g.][]{LeurgansRoss1992,Burdick1995}. Also, in signal and video processing where sensors produce multiindexed data, e.g. over spatial, frequency, and temporal dimensions \parencite[e.g.][]{DeLathauwerCastaing2007,KofidisRegalia2005}, in telecommunications \parencite[e.g.][]{DeAlmeidaEtAl2007}. Other examples of multiway data include 3D images of the brain, where the modes are the 3 spatial dimensions, and spatiotemporal weather imaging data, a set of image sequences represented as 2 spatial modes and 1 temporal mode.


Complex data are collected at different times and/or under several conditions often involving a large number of multiindexed variables represented as tensorvalued data \parencite{KoldaBader2009}. They occur in largescale longitudinal studies \parencite[e.g.][]{Hoff2015}, in agricultural experiments and chemometrics and spectroscopy \parencite[e.g.][]{LeurgansRoss1992,Burdick1995}, in signal and video processing where sensors produce multiindexed data, e.g. over spatial, frequency, and temporal dimensions \parencite[e.g.][]{DeLathauwerCastaing2007,KofidisRegalia2005}, and in telecommunications \parencite[e.g.][]{DeAlmeidaEtAl2007}. Other examples of multiway data include 3D images of the brain, where the modes are the 3 spatial dimensions, and spatiotemporal weather imaging data, a set of image sequences represented as 2 spatial modes and 1 temporal mode.




\begin{itemize}


\item Review \cite{ZhouLiZhu2013} and see how you compare with them. They focus on the forward regression model with a scalar response but they claim that "Exploit ing the array structure in imaging data, the new method substantially reduces the dimensionality of imaging data, which leads to efficient estimation and prediction."


\item Read \cite{ZhouEtAl2023} to figure out the distribution they use for the tensorvalued predictors and briefly describe what they do.


\item Read \cite{RabusseauKadri2016} to figure out what they do. They seem to draw both the response and the predictors from tensornormal with iid N(0,1) entries: "In order to leverage the tensor structure of the output data, we formulate the problem as the minimization of a least squares criterion subject to a multilinear rank constraint on the regression tensor. The rank constraint enforces the model to capture lowrank structure in the outputs and to explain dependencies between inputs and outputs in a lowdimensional multilinear subspace."


\item


\end{itemize}


% \begin{itemize}


% \item Review \cite{ZhouLiZhu2013} and see how you compare with them. They focus on the forward regression model with a scalar response but they claim that "Exploiting the array structure in imaging data, the new method substantially reduces the dimensionality of imaging data, which leads to efficient estimation and prediction."


% \item Read \cite{ZhouEtAl2023} to figure out the distribution they use for the tensorvalued predictors and briefly describe what they do.


% \item Read \cite{RabusseauKadri2016} to figure out what they do. They seem to draw both the response and the predictors from tensornormal with iid N(0,1) entries: "In order to leverage the tensor structure of the output data, we formulate the problem as the minimization of a least squares criterion subject to a multilinear rank constraint on the regression tensor. The rank constraint enforces the model to capture lowrank structure in the outputs and to explain dependencies between inputs and outputs in a lowdimensional multilinear subspace."


% \end{itemize}




%  RabusseauKadri2016 Y  x for tensor Y with vector x (HOLRR; Higher Order LowRank Regression)


%  LiZhang2017 Y  x for tensoe Y with vector x (envelope model)


%  ZhouEtAl2023 Y  x for tensor Y with vector x (sparse and partially observed)




%  ZhouLiZhu2013 y\in R (GLM) for y  Z, X for tensor X


%  HaoEtAl2021 y\in R for y  X for tensor X (sparse, element wise Bsplines)




% Tensor regression models have been proposed to exploit the special structure of tensor covariates, e.g. \cite{HaoEtAl2021,ZhouLiZhu2013}, or tensor responses \cite{RabusseauKadri2016,LiZhang2017,ZhouEtAl2023} \cite{HaoEtAl2021} modeled a scalar response as a flexible nonparametric function of tensor covariates. \cite{ZhouLiZhu2013} assume the scalar response has a distribution in the exponential family given the tensorvalued predictors and model the link function as a multilinear function of the predictors. \cite{LiZhang2017} model the tensorvalued response as tensor normal. Rather than using $L_1$ type penalty functions to induce sparsity, they employ the envelope method (Cook, Li, and Chiaromonte Citation2010) to estimate the unknown regression coefficient. Moreover, the envelope method essentially identifies and uses the material information jointly. They develop an estimation algorithm and study the asymptotic properties of the estimator. the scalar response as These models try to utilize the sparse and lowrank structures in the tensors  either in the regression coefficient tensor or the response tensor  to boost performance on the regression task by reducing the number of free parameters.




Tensor regression models have been proposed to leverage the structure inherent in tensor valued data. For instance, \textcite{HaoEtAl2021,ZhouLiZhu2013} focus on tensor covariates, while \textcite{RabusseauKadri2016,LiZhang2017,ZhouLiZhu2013} focus on tensor responses, and \textcite{Hoff2015,Lock2018} consider tensor on tensor regression. \textcite{HaoEtAl2021} modeled a scalar response as a flexible nonparametric function of tensor covariates. \textcite{ZhouLiZhu2013} assume the scalar response has a distribution in the exponential family given the tensorvalued predictors with the link modeled as a multilinear function of the predictors. \textcite{RabusseauKadri2016} model the tensorvalued response as a linear model with tensor valued regression coefficients subject to a multilinear rank constraint. \textcite{LiZhang2017} approach the problem with a similar linear model but instead of a low rank constraint the error term is assumed to have a separable Kronecker product structure while using a generalization of the envelope model \parencite{CookLiChiaromonte2010}. \textcite{ZhouEtAl2023} focus on partially observed tensor response given vectorvalued predictors with modewise sparsity constraints in the regression coefficients. \textcite{Hoff2015} extends an existing bilinear regression model to a tensor on tensor of conformable modes and dimensions regression model based on a Tucker product. \textcite{Lock2018} uses a tensor contraction to build a penalized least squares model for a tensor with arbitrary number of modes and dimensions.




Our approach considers the general regression problem of fitting a response of general form (univariate, multivariate, tensorvalued) on a tensorvalue predictor. We operate in the context of sufficient dimension reduction \parencite[e.g.]{Cook1998,Li2018} based on inverse regression, which leads us to regressing the tensorvalued predictor on the response. In our setting, this necessitates transforming the response to tensorvalued functions, regardless of whether it is itself tensorvalued. Because of the setting, our method shares commonalities with the tensor regression models referred to above, yet the modeling and methodology are novel.


Specifically, our tensortotensor regression model is a generalized multilinear model similar to the generalized linear model of \cite{ZhouLiZhu2013}. % but with tensor valued response by applying (a known) tensor valued function to the response in an inverse regression setting, reversing the role of response and predictors.


To bypass the explosion of number of parameters to estimate, we assume the inverse regression error covariance has Kronecker product structure as do \textcite{LiZhang2017}. Our maximum likelihoodbased estimation does not require any penalty terms in contrast to the least squares and/or sparse approaches \cite{????} . In the case of a tensor (multilinear) normal, given the tensorvalued function of the response, our model exhibits similarities to the multilinear modeling of \textcite{Hoff2015}, but we use a generalized multilinear model and estimate the parameters with maximum likelihood instead of least squares. Moreover, a common issue in multilinear tensor regression models is the unidentifiability of the parameters, which we address in a completely different manner. For example, \cite{LiZhang2017} develop theory that is based on orthogonal projection matrices to uniquely identify a subspace, while our approach is more general as it uses manifold theory.






Tensor regression models have been proposed to exploit the special structure of tensor covariates, e.g. \cite{HaoEtAl2021,ZhouLiZhu2013}, or tensor responses \cite{RabusseauKadri2016,LiZhang2017,ZhouEtAl2023} \cite{HaoEtAl2021} modeled a scalar response as a flexible nonparametric function of tensor covariates. \cite{ZhouLiZhu2013} assume the scalar response has a distribution in the exponential family given the tensorvalued predictors and model the link function as a multilinear function of the predictors. \cite{LiZhang2017} model the tensorvalued response as tensor normal. Rather than using $L_1$ type penalty functions to induce sparsity, they employ the envelope method (Cook, Li, and Chiaromonte Citation2010) to estimate the unknown regression coefficient. Moreover, the envelope method essentially identifies and uses the material information jointly. They develop an estimation algorithm and study the asymptotic properties of the estimator. the scalar response as These models try to utilize the sparse and lowrank structures in the tensors – either in the regression coefficient tensor or the response tensor – to boost performance on the regression task by reducing the number of free parameters.


In this paper we present a modelbased \emph{Sufficient Dimension Reduction} (SDR) method for tensorvalued data with distribution in the quadratic exponential family assuming a separable Kronecker product structure of the first and second moment. By generalizing the parameter space to embedded manifolds we obtain consistency and asymptotic normality results while allowing great modeling flexibility in the linear sufficient dimension reduction.




The quadratic exponential family contains the tensor normal and the tensor Ising distributions, for continuous and binary tensorvalued random variables, respectively.




Multilinear tensor normal models have been used in various applications, including medical imaging \parencite{BasserPajevic2007,DrydenEtAl2009}, spatiotemporal data analysis \parencite{GreenewaldHero2014}, regression analysis for longitudinal relational data \parencite{Hoff2015}. One of the most important uses of the multilinear normal (MLN) distribution, and hence tensor analysis, is perhaps in magnetic resonance imaging (MRI) \parencite{OhlsonEtAl2013}. A recent survey \parencite{WangEtAl2022} and references therein contain more information and potential applications of multilinear tensor normal models.




In this paper we present a modelbased \emph{Sufficient Dimension Reduction} (SDR) method for tensorvalued data with distribution in the quadratic exponential family assuming a Kronecker product structure of the first and second moment. By generalizing the parameter space to embedded manifolds we obtain consistency and asymtotic normality results while allowing great modeling flexibility in the linear sufficient dimension reduction.


The Ising model for multivariate binary outcomes belongs to the class of discrete exponential families. Its defining feature is that the sufficient statistic involves a quadratic term to capture correlations arising from pairwise interactions.


The tensor Ising model is a higherorder Ising model for tensorvalued binary outcomes.


%From \cite{MukherjeeEtAl2020}


Higherorder Ising models arise naturally in the study of multiatom interactions in lattice gas models, such as the squarelattice eightvertex model, the AshkinTeller model, and Suzuki's pseudo3D anisotropic model (cf. [6, 33, 36, 37, 49, 55, 56, 61, 62] and the references therein). More recently, higherorder spin systems have also been proposed for modeling peergroup effects in social networks [22]. \textcite{MukherjeeEtAl2020} proposed a maximum pseudolikelihood estimation algorithm for a oneparameter tensorIsing model. \efi{Daniel: comment on what these guys do and contrast with your setting} In our approach, the parameter is not constrained to be scalar


We derive maximum likelihood estimates for all first and second order interactions and propose a gradientbased optimization algorithm.




Our results in the framework of the quadratic exponential family for tensorvalued variables; i.e., consistency and asymptotic normality, apply to both tensor normal and tensor Ising models.




The structure of this paper is as follows. In \cref{sec:notation} we introduce our notation. \Cref{sec:problemformulation} decribes the exact problem and in \cref{sec:gmlmmodel} we introduce our model. Continuing in \cref{sec:mlestimation} we provide the basis for a general maximum likelihood estimation procedure and derive specialized methods for tensor normal as well as the tensor Ising distributions. \Cref{sec:manifolds} gives a short introduction into manifolds and provides the basis for applying the consistency and asymtotic normality results from \cref{sec:asymtotics}. Simulations for continuous and binary predicotrs are subject of \cref{sec:simulations}. Finally, in \cref{sec:dataanalysis} we apply our model to EEG data and perform a prove of concept data analysis example where a chess board is interprated as a collection of binary $8\times 8$ matrices.





@ 359,7 +381,7 @@ where the vectorized quantities $\vec{\ten{X}}\in\mathbb{R}^p$ and $\vec\ten{F}(


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\section{Problem Formulation}\label{sec:problemformulation}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


Our goal is to infer the cumulative distribution function (cdf) $F$ of $Y\mid \ten{X}$, where $\ten{X}$ is assumed to admit $r$tensor structure of dimension $p_1\times ... \times p_r$ with continuous or discrete entries and the response $Y$ is unconstrained. The predictor $\ten{X}$ is a complex object; to simplify the problem we assume there exists a tensor valued function of lower dimension $\ten{R}:\ten{X}\mapsto \ten{R}(\ten{X})$ such that


Our goal is to infer the cumulative distribution function (cdf) $F$ of $Y\mid \ten{X}$, where $\ten{X}$ is assumed to admit $r$tensor structure of dimension $p_1\times ... \times p_r$ with continuous or discrete entries, and the response $Y$ is unconstrained. The predictor $\ten{X}$ is a complex object; to simplify the problem we assume there exists a tensorvalued function of lower dimension $\ten{R}:\ten{X}\mapsto \ten{R}(\ten{X})$ such that


\begin{displaymath}


F(Y\mid \ten{X}) = F(Y\mid \ten{R}(\ten{X})).


\end{displaymath}



@ 379,7 +401,7 @@ f_{\mat{\eta}_y}(\ten{X}\mid Y = y)


&= h(\ten{X})\exp(\t{\mat{\eta}_y}\mat{t}(\ten{X})  b(\mat{\eta}_y)) \nonumber \\


&= h(\ten{X})\exp(\langle \mat{t}_1(\ten{X}), \mat{\eta}_{1y} \rangle + \langle \mat{t}_2(\ten{X}), \mat{\eta}_{2y} \rangle  b(\mat{\eta}_{y})) \label{eq:quaddensity}


\end{align}


where $\mat{t}_1(\ten{X})=\vec \ten{X}$ and $\mat{t}_2(\ten{X})$ is linear in $\ten{X}\circ\ten{X}$. The dependence of $\ten{X}$ on $Y$ is fully captured in the natural parameter $\mat{\eta}_y$. The function $h$ is nonnegative realvalued and $b$ is assumed to be at least twice continuously differentiable and strictly convex. An important feature of the \emph{quadratic exponential family} is that the distribution of its members is fully characterized by their first two moments. Distributions within the quadratic exponential family include the \emph{tensor normal} (\cref{sec:tensornormalestimation}) and \emph{tensor Ising model} (\cref{sec:ising_estimation}, a generalization of the (inverse) Ising model which is multivariate Bernoulli with up to second order interactions) and mixtures of these two.


where $\mat{t}_1(\ten{X})=\vec \ten{X}$ and $\mat{t}_2(\ten{X})$ is linear in $\ten{X}\circ\ten{X}$. The dependence of $\ten{X}$ on $Y$ is fully captured in the natural parameter $\mat{\eta}_y$. The function $h$ is nonnegative realvalued and $b$ is assumed to be at least twice continuously differentiable and strictly convex. An important feature of the \emph{quadratic exponential family} is that the distribution of its members is fully characterized by their first two moments. Distributions within the quadratic exponential family include the \emph{tensor normal} (\cref{sec:tensornormalestimation}) and \emph{tensor Ising model} (\cref{sec:ising_estimation}, a generalization of the (inverse)\footnote{\todo{}} Ising model which is multivariate Bernoulli with up to second order interactions) and mixtures of these two.




%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\section{The Generalized MultiLinear Model}\label{sec:gmlmmodel}



@ 440,7 +462,7 @@ Under the quadratic exponential family model \eqref{eq:quadraticexpfam}, a su




The reduction in vectorized form is $\vec\ten{R}(\ten{X})=\t{\mat{B}}\vec(\ten{X}  \E\ten{X})$, where $\mat{B} = \bigotimes_{k = r}^{1}\mat{\beta}_k$ with $\Span(\mat{B}) = \Span(\{\mat{\eta}_{1y}  \E_{Y}\mat{\eta}_{1Y} : y\in\mathcal{S}_Y\})$, using $\mathcal{S}_Y$ to denote the set of values of the random variable $Y$.




\cref{thm:sdr} obtains that the \emph{sufficient reduction} $\ten{R}(\ten{X})$ reduces $\ten{X}$ along each dimension linearly. The graph in \cref{fig:SDRvisual} is a visual depiction of the sufficient reduction.


\cref{thm:sdr} obtains the \emph{sufficient reduction} $\ten{R}(\ten{X})$ reduces $\ten{X}$ along each dimension linearly. The graph in \cref{fig:SDRvisual} is a visual depiction of the sufficient reduction.




\begin{figure}


\centering



@ 464,7 +486,7 @@ The reduction in vectorized form is $\vec\ten{R}(\ten{X})=\t{\mat{B}}\vec(\ten{X


\end{example}




\begin{example}[Matrix valued $\mat{X}$ ($r = 2$)]


Assuming $\mat{X}$ to be matrix valued, that is $r = 2$, $\mat{\theta} = (\overline{\mat{\eta}}, \mat{\beta}_1, \mat{\beta}_2, \mat{\Omega}_1, \mat{\Omega}_2)$, where the intercept term $\overline{\mat{\eta}}\in\mathbb{R}^{p_1\times p_2}$ is now matrix valued. Similar to \cref{ex:vectordist} with $\mat{F}_y\in\mathbb{R}^{q_1\times q_2}$ being matrix valued, the conditional density of $\mat{X}\mid Y = y$ reads


Assuming $\mat{X}$ to be matrix valued, that is $r = 2$, $\mat{\theta} = (\overline{\mat{\eta}}, \mat{\beta}_1, \mat{\beta}_2, \mat{\Omega}_1, \mat{\Omega}_2)$, where the intercept term $\overline{\mat{\eta}}\in\mathbb{R}^{p_1\times p_2}$ is now matrix valued. Similar to \cref{ex:vectordist} with $\mat{F}_y\in\mathbb{R}^{q_1\times q_2}$ being matrix valued, the conditional density of $\mat{X}\mid Y = y$ is


\begin{align*}


f_{\mat{\theta}}(\mat{X}\mid Y = y)


&= h(\mat{X})\exp(\langle\vec{\mat{X}}, \mat{\eta}_{1y}(\mat{\theta})\rangle + \langle\vec(\mat{X}\circ\mat{X}), \mat{\eta}_2(\mat{\theta})\rangle  b(\mat{\eta}_y(\mat{\theta}))) \\



@ 502,7 +524,8 @@ A straightforward and general method for parameter estimation is \emph{gradient


If $\mat{T}_2$ is the identity matrix $\mat{I}_{p(p + 1) / 2}$, then $\ten{G}_2(\mat{\eta}_y) = \ten{g}_2(\mat{\eta}_y)$.


\end{theorem}




Although the general case of any GMLM model can be fitted via gradient descent using \cref{thm:grad}, this may be very inefficient. In \cref{thm:grad}, $\mat{T}_2$ can be used to introduce flexible second moment structures. For example, it allows modeling effects differently for predictor components, as described in \cref{sec:ising_estimation} after Eqn. \eqref{eq:isingcondprob}. In the remainder, we focus on $\mat{T}_2$'s that are identity matrices. This approach simplifies the estimation algorithm and the speed of the numerical calculation in the case of tensor normal predictors. In the case of the tensor normal distribution, an iterative cyclic updating scheme is derived in \cref{sec:tensornormalestimation}, which has much faster convergence, is stable and does not require hyper parameters, as will be discussed later. On the other hand, the Ising model does not allow such a scheme. There we need to use a gradient based method, which is the subject of \cref{sec:ising_estimation}.


Although the general case of any GMLM model can be fitted via gradient descent using \cref{thm:grad}, this may be very inefficient. In \cref{thm:grad}, $\mat{T}_2$ can be used to introduce flexible second moment structures. For example, it allows modeling effects differently for predictor components, as described in \cref{sec:ising_estimation} after Eqn. \eqref{eq:isingcondprob}. In the remainder, we focus on $\mat{T}_2$'s that are identity matrices. This approach simplifies the estimation algorithm and the speed of the numerical calculation in the case of tensor normal predictors. % In the case of the tensor normal distribution,


An iterative cyclic updating scheme is derived in \cref{sec:tensornormalestimation}, which has much faster convergence, is stable and does not require hyperparameters, as will be discussed later. On the other hand, the Ising model does not allow such a scheme. There we need to use a gradientbased method, which is the subject of \cref{sec:ising_estimation}.




%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\subsection{Tensor Normal}\label{sec:tensornormalestimation}



@ 512,7 +535,14 @@ representing a vector of observations \parencite{OhlsonEtAl2013}. \textcite{Koll




The defining feature of the matrix normal distribution, and its tensor extension, is the Kronecker product structure of its covariance. This formulation, where the covariates are multivariate normal with multiway covariance structure modeled as a Kronecker product of matrices of much lower dimension, aims to overcome the significant modeling and computational challenges arising from the high computational complexity of manipulating tensor representations \parencite[see, e.g.,][]{HillarLim2013,WangEtAl2022}.




% Multilinear tensor normal models have been used in various applications, including medical imaging \parencite{BasserPajevic2007,DrydenEtAl2009}, spatiotemporal data analysis \parencite{GreenewaldHero2014}, regression analysis for longitudinal relational data \parencite{Hoff2015}. One of the most important uses of the multilinear normal (MLN) distribution, and hence tensor analysis, is perhaps in magnetic resonance imaging (MRI) \parencite{OhlsonEtAl2013}. A recent survey \parencite{WangEtAl2022} and references therein contain more information and potential applications of multilinear tensor normal models.


Multilinear tensor normal %Kroneckerseparable covariance


models have been used in various applications, including


medical imaging \parencite{BasserPajevic2007,DrydenEtAl2009}, spatiotemporal data analysis \parencite{GreenewaldHero2014}, regression analysis


for longitudinal relational data \parencite{Hoff2015}.


%, radar [AFJ10], and multipleinputmultipleoutput (MIMO) communications [WJS08].


One of the most important uses of the multilinear normal (MLN) distribution, and hence tensor analysis, is perhaps in magnetic resonance imaging (MRI) \parencite{OhlsonEtAl2013}.


A recent survey \parencite{WangEtAl2022} and references therein contain more information and potential applications of multilinear tensor normal models.


%The first occurrence of the \textit{matrix normal} we found, even though not explicitly called as such, was in \textcite{SrivastavaKhatri1979}.




Suppose $\ten{X}\mid Y = y$ follows a tensor normal distribution with mean $\ten{\mu}_y$ and covariance $\mat{\Sigma} = \bigkron_{k = r}^{1}\mat{\Sigma}_k$. We assume the distribution is nondegenerate which means that the covariances $\mat{\Sigma}_k$ are symmetric positive definite matrices. Its density is given by


\begin{displaymath}



@ 528,7 +558,7 @@ where we used that $\overline{\ten{\eta}} = 0$ due to $0 = \E\ten{X} = \E\E[\ten


\ten{g}_1(\mat{\eta}_y) = \E[\ten{X}\mid Y = y] = \ten{\mu}_y, \qquad


\ten{G}_2(\mat{\eta}_y) = \ten{g}_2(\mat{\eta}_y) = \E[\ten{X}\circ\ten{X}\mid Y = y] \equiv \bigkron_{k = r}^1\mat{\Sigma}_k + (\vec{\ten{\mu}}_y)\t{(\vec{\ten{\mu}}_y)}.


\end{displaymath}


In practice, we assume we have a random sample of $n$ observations $(\ten{X}_i, \ten{F}_{y_i})$ from the joint distribution. We start the estimation process by demeaning them. Then, only the reduction matrices $\mat{\beta}_k$ and the scatter matrices $\mat{\Omega}_k$ need to be estimated. To solve the optimization problem \eqref{eq:mle}, with $\overline{\ten{\eta}} = 0$ we initialize the parameters using a simple heuristic approach. % For initial estimates $\hat{\mat{\beta}}_k^{(0)}$ we


In practice, we assume we have a random sample of $n$ observations $(\ten{X}_i, \ten{F}_{y_i})$ from the joint distribution. We start the estimation process by demeaning the data. Then, only the reduction matrices $\mat{\beta}_k$ and the scatter matrices $\mat{\Omega}_k$ need to be estimated. To solve the optimization problem \eqref{eq:mle}, with $\overline{\ten{\eta}} = 0$ we initialize the parameters using a simple heuristic approach. % For initial estimates $\hat{\mat{\beta}}_k^{(0)}$ we


First, we compute moment based modewise marginal covariance estimates $\widehat{\mat{\Sigma}}_k(\ten{X})$ and $\widehat{\mat{\Sigma}}_k(\ten{F}_Y)$ as


\begin{displaymath}


\widehat{\mat{\Sigma}}_k(\ten{X}) = \frac{1}{n}\sum_{i = 1}^{n} (\ten{X}_i)_{(k)}\t{(\ten{X}_i)_{(k)}}, \qquad



@ 538,7 +568,7 @@ Then, for every mode $k = 1, \ldots, r$, we compute the first $j = 1, \ldots, q_


\begin{align*}


\mat{U}_k &= (\mat{v}_1(\widehat{\mat{\Sigma}}_1(\ten{X})), \ldots, \mat{v}_{q_k}(\widehat{\mat{\Sigma}}_{q_k}(\ten{X}))), \\


\mat{D}_k &= \diag(\mat{v}_1(\widehat{\mat{\Sigma}}_1(\ten{X}))\mat{v}_1(\widehat{\mat{\Sigma}}_1(\ten{F}_{Y})), \ldots, \mat{v}_{q_k}(\widehat{\mat{\Sigma}}_{q_k}(\ten{X}))\mat{v}_{q_k}(\widehat{\mat{\Sigma}}_k(\ten{F}_{Y}))), \\


\mat{V}_k &= (\mat{v}_1(\widehat{\mat{\Sigma}}_1(\ten{F}_Y), \ldots, \mat{v}_{q_k}(\widehat{\mat{\Sigma}}_{q_k}(\ten{F}_Y)). \\


\mat{V}_k &= (\mat{v}_1(\widehat{\mat{\Sigma}}_1(\ten{F}_Y), \ldots, \mat{v}_{q_k}(\widehat{\mat{\Sigma}}_{q_k}(\ten{F}_Y)).


\end{align*}


The initial value of $\mat{\beta}_k$ is


\begin{displaymath}



@ 561,7 +591,7 @@ Given $\hat{\mat{\beta}}_1, \ldots, \hat{\mat{\beta}}_r, \hat{\mat{\Omega}}_1,


\hat{\mat{\Omega}}_j.


\end{equation}


%For the scatter matrices $\mat{\Omega}_j$, we need to fudge a bit.


Equating the partial gradient of the $j$th scatter matrix $\mat{\Omega}_j$ in \cref{thm:grad} to zero ( $\nabla_{\mat{\Omega}_j}l_n = 0$) gives a quadratic matrix equation. This is due to the dependence of $\ten{\mu}_y$ on $\mat{\Omega}_j$. In practice though, it is faster, more stable, and equally accurate to use modewise covariance estimates via the residuals


Equating the partial gradient of the $j$th scatter matrix $\mat{\Omega}_j$ in \cref{thm:grad} to zero ( $\nabla_{\mat{\Omega}_j}l_n = 0$) gives a quadratic matrix equation due to the dependence of $\ten{\mu}_y$ on $\mat{\Omega}_j$. In practice though, it is faster, more stable, and equally accurate to use modewise covariance estimates via the residuals


\begin{displaymath}


\hat{\ten{R}}_i = \ten{X}_i  \hat{\ten{\mu}}_{y_i} = \ten{X}_i  \ten{F}_{y_i}\mlm_{k = 1}^{r}\hat{\mat{\Omega}}_k^{1}\hat{\mat{\beta}}_k.


\end{displaymath}



@ 569,7 +599,7 @@ The estimates are computed via


\begin{displaymath}


\tilde{\mat{\Sigma}}_j = \sum_{i = 1}^n (\hat{\ten{R}}_i)_{(j)} \t{(\hat{\ten{R}}_i)_{(j)}},


\end{displaymath}


where $\tilde{s}\tilde{\mat{\Sigma}}_j = \hat{\mat{\Omega}}_j^{1}$. For scaling we use that the mean squared error has to be equal to the trace of the covariance estimate,


where $\tilde{s}\tilde{\mat{\Sigma}}_j = \hat{\mat{\Omega}}_j^{1}$. To decide on the scaling factor $\tilde{s}$ we use that the mean squared error has to be equal to the trace of the covariance estimate,


\begin{displaymath}


\frac{1}{n}\sum_{i = 1}^n \langle \hat{\ten{R}}_i, \hat{\ten{R}}_i \rangle = \tr\bigkron_{k = r}^{1}\hat{\mat{\Omega}}_k^{1} = \prod_{k = 1}^{r}\tr{\hat{\mat{\Omega}}_k^{1}} = \tilde{s}^r\prod_{k = 1}^{r}\tr{\tilde{\mat{\Sigma}}_k},


\end{displaymath}



@ 580,43 +610,23 @@ so that


resulting in the estimates $\hat{\mat{\Omega}}_j = (\tilde{s}\tilde{\mat{\Sigma}}_j)^{1}$.


Estimation is then performed by updating the estimates $\hat{\mat{\beta}}_j$ via \eqref{eq:tensor_normal_beta_solution} for $j = 1, \ldots, r$, and then recompute the $\hat{\mat{\Omega}}_j$ estimates simultaneously keeping the $\hat{\mat{\beta}}_j$'s fixed. This procedure is repeated until convergence. % Convergence is very fast, experiments showed that convergence occures usualy in less than $10$ iterations.




A technical detail for numerical stability is to ensure that the scaled values $\tilde{s}\tilde{\mat{\Sigma}}_j$, assumed to be symmetric and positive definite, are well conditioned. Thus, we estimate the condition number of $\tilde{s}\tilde{\mat{\Sigma}}_j$ prior to computing the inverse. In case of ill conditioning, we use the regularized $\hat{\mat{\Omega}}_j = (\tilde{s}\tilde{\mat{\Sigma}}_j + 0.2 \lambda_{1}(\tilde{s}\tilde{\mat{\Sigma}}_j)\mat{I}_{p_j})^{1}$ instead, where $\lambda_{1}(\tilde{s}\tilde{\mat{\Sigma}}_j)$ is the first (maximum) eigenvalue. Experiments showed that this regularization is usually only required in the first few iterations.


A technical detail for numerical stability is to ensure that the scaled values $\tilde{s}\tilde{\mat{\Sigma}}_j$, assumed to be symmetric and positive definite, are well conditioned. Thus, we estimate the condition number of $\tilde{s}\tilde{\mat{\Sigma}}_j$ before computing the inverse. In case of illconditioning, we use the regularized $\hat{\mat{\Omega}}_j = (\tilde{s}\tilde{\mat{\Sigma}}_j + 0.2 \lambda_{1}(\tilde{s}\tilde{\mat{\Sigma}}_j)\mat{I}_{p_j})^{1}$ instead, where $\lambda_{1}(\tilde{s}\tilde{\mat{\Sigma}}_j)$ is the first (maximum) eigenvalue. Experiments showed that this regularization is usually only required in the first few iterations.




Furthermore, if the parameter space follows a more general setting as in \cref{thm:parammanifold}, updating may produces estimates outside the parameter space. A simple and efficient method is to project every updated estimate onto the corresponding manifold.


Furthermore, if the parameter space follows a more general setting as in \cref{thm:parammanifold}, updating may produce estimates outside the parameter space. A simple and efficient method is to project every updated estimate onto the corresponding manifold.




A standard method to compute the MLE of a Kronecker product is blockcoordinate descent, also


referred to as the ``flipflop algorithm.'' This algorithm was proposed independently by \textcite{MardiaGoodall1993} and \textcite{Dutilleul1999} and was later called ``flipflop'' algorithm by \textcite{LuZimmerman2005} for the computation of the maximum likelihood estimators of the components of a separable covariance matrix. \textcite{ManceurDutilleul2013} extended the ``flipflop'' algorithm for the computation of the MLE of the separable covariance structure of a 3way and 4way normal distribution and obtained a lower bound for the sample size required for its existence. The same issue was also studied by \textcite{DrtonEtAl2020} in the case of a twoway array (matrix). Our algorithm uses a


similar “flipflop” approach by iteratively updating the $\mat{\beta}_k$'s and $\mat{\Omega}_k$'s, one after the other.






\subsubsection{Matrix and Tensor Normal}




The tensor normal model is the extension of the matrix normal to tensorvalued random variables and a member of the quadratic exponential family \eqref{eq:quadraticexpfam} under \eqref{eq:eta2}. \textcite{Dawid1981,Arnold1981} introduced the term matrix normal and, in particular, \textcite{Arnold1981} provided several theoretical results, such as its density, moments and conditional distributions of its components. The matrix normal distribution is a bilinear normal distribution; a distribution of a twoway array, each component


representing a vector of observations \parencite{OhlsonEtAl2013}. \textcite{KolloVonRosen2005,Hoff2011,OhlsonEtAl2013} presented the extension of the bilinear to the multilinear normal distribution, what we call tensor normal in this paper, using a parallel extension of bilinear matrices to multilinear tensors \parencite{Comon2009}.




The defining feature of the matrix normal distribution, and its tensor extension, is the Kronecker product structure of its covariance. This formulation, where the covariates are multivariate normal with multiway covariance structure modeled as a Kronecker


product of matrices of much lower dimension, aims to overcome the significant modeling and computational challenges arising from the


high computational complexity of manipulating tensor representations \parencite[see, e.g.,][]{HillarLim2013,WangEtAl2022}.




Multilinear tensor normal %Kroneckerseparable covariance


models have been used in various applications, including


medical imaging \parencite{BasserPajevic2007,DrydenEtAl2009}, spatiotemporal data analysis \parencite{GreenewaldHero2014}, regression analysis


for longitudinal relational data \parencite{Hoff2015}.


%, radar [AFJ10], and multipleinputmultipleoutput (MIMO) communications [WJS08].


One of the most important uses of the multilinear normal (MLN) distribution, and hence tensor analysis, is perhaps in magnetic resonance imaging (MRI) \parencite{OhlsonEtAl2013}.


A recent survey \parencite{WangEtAl2022} and references therein contain more information and potential applications of multilinear tensor normal models.


%The first occurrence of the \textit{matrix normal} we found, even though not explicitly called as such, was in \textcite{SrivastavaKhatri1979}.








%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\subsection{Ising Model}\label{sec:ising_estimation}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


The Ising\footnote{Also known as the \emph{LenzIsing} model as the physical assumptions of the model where developed by both Lenz and Ising \parencite{Niss2005}. Ising gave a closed form solution for the 1dimensional lattice, that is, a linear chain \parencite{Ising1925}.} model \parencite{Lenz1920,Ising1925,Niss2005} is a mathematical model originating in statistical physics to study ferromagnetism in a thermodynamic setting. It describes magentic dipoles (atomic ``spins'') which can take two states ($\pm 1$) while allowing twoway interactions between direct neighbours on a lattice, a discrete grid. The model assumes all elementary magnets to be the same, which translates to all having the same coupling strength (twoway interactions) governed by a single parameter relating to the temperature of the system. Nowadays, the Ising model, in its general form, allows for different coupling strength for every (symmetric) interaction as well as an external magnetic field acting on every magnetic dipole separately. A modern review is given by \textcite{NguyenEtAl2017}.


The Ising\footnote{Also known as the \emph{LenzIsing} model as the physical assumptions of the model where developed by both Lenz and Ising \parencite{Niss2005}. Ising gave a closed form solution for the 1dimensional lattice, that is, a linear chain \parencite{Ising1925}.} model \parencite{Lenz1920,Ising1925,Niss2005} is a mathematical model originating in statistical physics to study ferromagnetism in a thermodynamic setting. It describes magentic dipoles (atomic ``spins'') which can take two states ($\pm 1$) while allowing twoway interactions between direct neighbours on a lattice, a discrete grid. The model assumes all elementary magnets to be the same, which translates to all having the same coupling strength (twoway interactions) governed by a single parameter relating to the temperature of the system. Nowadays, the Ising model, in its general form, allows for different coupling strength for every (symmetric) interaction as well as an external magnetic field acting on every magnetic dipole separately. A review is given by \textcite{NguyenEtAl2017}.




In statistics, the Ising model is used to model multivariate binary data. That is, the states are ${0, 1}$ instead of $\pm 1$. It is related to a multitude of other models; \emph{Graphical Models} and \emph{Markov Random Fields} to describe conditional dependence \parencite{Lauritzen1996,WainwrightJordan2008,LauritzenRichardson2002}, \emph{Potts models} \parencite{Besag1974,ChakrabortyEtAl2022} which generalize the Ising model to multiple states, the \emph{multivariate Bernoulli distribution} \parencite{Whittaker1990,JohnsonEtAl1997,DaiDingWahba2013} considering also interactions (treeway and higher), to give the most prominent.




The $p$dimensional Ising model is a discrete probability distribution on the set of $p$dimensional binary vectors $\mat{x}\in\{0, 1\}^p$ with pmf given by


The $p$dimensional Ising model is a discrete probability distribution on the set of $p$dimensional binary vectors $\mat{x}\in\{0, 1\}^p$ with probability mass function (pmf) given by


\begin{displaymath}


P_{\mat{\gamma}}(\mat{x}) = p_0(\mat{\gamma})\exp(\t{\vech(\mat{x}\t{\mat{x}})}\mat{\gamma}).


\end{displaymath}



@ 624,20 +634,20 @@ The scaling factor $p_0(\mat{\gamma})\in\mathbb{R}_{+}$ ensures that $P_{\mat{\g


\begin{equation}\label{eq:isingpartitionfunction}


p_0(\mat{\gamma})^{1} = \sum_{\mat{x}\in\{0, 1\}^p}\exp(\t{\vech(\mat{x}\t{\mat{x}})}\mat{\gamma}).


\end{equation}


By an abuse of notation, we let $\mat{\gamma}_{j l}$ denote the element of $\mat{\gamma}$ corresponding to $\mat{x}_j\mat{x}_l$ in $\vech(\mat{x}\t{\mat{x}})$\footnote{Specifically, the element $\mat{\gamma}_{j l}$ of $\mat{\gamma}$ is a short hand for $\mat{\gamma}_{\iota(j, l)}$ with $\iota(j, l) = (\min(j, l)  1)(2 p  \min(j, l)) / 2 + \max(j, l)$ mapping the matrix row index $j$ and column index $l$ to the corresponding half vectorization indices $\iota(j, l)$.}. The ``diagonal'' parameter $\mat{\gamma}_{j j}$ expresses the conditional log odds of $X_j = 1\mid X_{j} = \mat{0}$, where the negative subscript in $X_{j}$ describes the $p  1$ dimensional vector $X$ with the $j$th element removed. The off diagonal entries $\mat{\gamma}_{j l}$, $j\neq l$, are equal to the conditional log odds of simultaneous occurrence $X_j = 1, X_l = 1 \mid X_{j, l} = \mat{0}$. More precisely, the conditional probabilities and the natural parameters are related via


Abusing notation, we let $\mat{\gamma}_{j l}$ denote the element of $\mat{\gamma}$ corresponding to $\mat{x}_j\mat{x}_l$ in $\vech(\mat{x}\t{\mat{x}})$\footnote{Specifically, the element $\mat{\gamma}_{j l}$ of $\mat{\gamma}$ is a short hand for $\mat{\gamma}_{\iota(j, l)}$ with $\iota(j, l) = (\min(j, l)  1)(2 p  \min(j, l)) / 2 + \max(j, l)$ mapping the matrix row index $j$ and column index $l$ to the corresponding half vectorization indices $\iota(j, l)$.}. The ``diagonal'' parameter $\mat{\gamma}_{j j}$ expresses the conditional log odds of $X_j = 1\mid X_{j} = \mat{0}$, where the negative subscript in $X_{j}$ describes the $p  1$ dimensional vector $X$ with the $j$th element removed. The off diagonal entries $\mat{\gamma}_{j l}$, $j\neq l$, are equal to the conditional log odds of simultaneous occurrence $X_j = 1, X_l = 1 \mid X_{j, l} = \mat{0}$. More precisely, the conditional probabilities and the natural parameters are related via


\begin{align}


\mat{\gamma}_{j j} &= \log\frac{P_{\mat{\gamma}}(X_j = 1\mid X_{j} = \mat{0})}{1  P_{\mat{\gamma}}(X_j = 1\mid X_{j} = \mat{0})}, \nonumber \\


\mat{\gamma}_{j l} &= \log\frac{1  P_{\mat{\gamma}}(X_j = 1\mid X_{j} = \mat{0})P_{\mat{\gamma}}(X_l = 1\mid X_{l} = \mat{0})}{P_{\mat{\gamma}}(X_j = 1\mid X_{j} = \mat{0})P_{\mat{\gamma}}(X_l = 1\mid X_{l} = \mat{0})}\frac{P_{\mat{\gamma}}(X_j = 1, X_l = 1\mid X_{j, l} = \mat{0})}{1  P_{\mat{\gamma}}(X_j = 1, X_l = 1\mid X_{j, l} = \mat{0})} \label{eq:isingtwowaylogodds}.


\end{align}


Conditional Ising models, incorporating the information of covariates $Y$ into the model, were considered by \textcite{ChengEtAl2014,BuraEtAl2022}. The direct way is to parameterize $\mat{\gamma} = \mat{\gamma}_y$ by the covariate $Y = y$ to model a conditional distribution $P_{\mat{\gamma}_y}(\mat{x}\mid Y = y)$.




We extend the conditional pmf by allowing the binary variables to be tensor valued; that is, we set $\mat{x} = \vec{\ten{X}}$, with dimension $p = \prod_{k = 1}^{r}p_k$ for $\ten{X}\in\{ 0, 1 \}^{p_1\times\cdots\times p_r}$. The tensor structure of $\ten{X}$ is accommodated by assuming Kronecker product constraints to the parameter vector $\mat{\gamma}_y$ in a similar fashion as for the tensor normal model. This means that we compare the pmf $P_{\mat{\gamma}_y}(\vec{\ten{X}}  Y = y)$ with the quadratic exponential family \eqref{eq:quadraticexpfam} with the natural parameters modeled by \eqref{eq:eta1} and \eqref{eq:eta2}. A detail to be considered is that the diagonal of $(\vec{\ten{X}})\t{(\vec{\ten{X}})}$ is equal to $\vec{\ten{X}}$. This gives the GMLM model as


We extend the conditional pmf by allowing the binary variables to be tensor valued; that is, we set $\mat{x} = \vec{\ten{X}}$, with dimension $p = \prod_{k = 1}^{r}p_k$ for $\ten{X}\in\{ 0, 1 \}^{p_1\times\cdots\times p_r}$. The tensor structure of $\ten{X}$ is accommodated by assuming Kronecker product constraints to the parameter vector $\mat{\gamma}_y$ in a similar fashion as for the tensor normal model. This means that we compare the pmf $P_{\mat{\gamma}_y}(\vec{\ten{X}}  Y = y)$ with the quadratic exponential family \eqref{eq:quadraticexpfam} with the natural parameters modeled by \eqref{eq:eta1} and \eqref{eq:eta2}. A detail to be considered is that the diagonal of $(\vec{\ten{X}})\t{(\vec{\ten{X}})}$ is equal to $\vec{\ten{X}}$, which results in the GMLM being expressed as


\begin{align}


P_{\mat{\gamma}_y}(\ten{X} \mid Y = y)


&= p_0(\mat{\gamma}_y)\exp(\t{\vech((\vec{\ten{X}})\t{(\vec{\ten{X}})})}\mat{\gamma}_y) \nonumber \\


&= p_0(\mat{\gamma}_y)\exp\Bigl(\Bigl\langle \ten{X}, \ten{F}_y\mlm_{k = 1}^{r}\mat{\beta}_k \Bigr\rangle + \Bigl\langle\ten{X}\mlm_{k = 1}^{r}\mat{\Omega}_k, \ten{X}\Bigr\rangle\Bigr)\label{eq:isingcondprob}


\end{align}


where we set $\overline{\ten{\eta}} = 0$ and $\mat{T}_2$ to the identity. This imposes an additional constraint to the model, the reason is that the diagonal elements of $\mat{\Omega} = \bigkron_{k = r}^{1}\mat{\Omega}_k$ take the role of $\overline{\ten{\eta}}$, although not fully. Having the diagonal of $\mat{\Omega}$ and $\overline{\ten{\eta}}$ handling the self interaction effects might lead to interference in the optimization routine. Another approach would be to use the $\mat{T}_2$ matrix to set the corresponding diagonal elements of $\mat{\Omega}$ to zero and let $\overline{\ten{\eta}}$ handle the self interaction effect. All of those approaches, namely setting $\overline{\ten{\eta}} = 0$, keeping $\overline{\ten{\eta}}$ or using $\mat{T}_2$, are theoretically solid and compatible with \cref{thm:grad,thm:parammanifold,thm:asymptoticnormalitygmlm}, assuming all axis dimensions $p_k$ are nondegenerate, that is $p_k > 1$ for all $k = 1, \ldots, r$. Regardless, under our modeling choice the relation between the natural parameters $\mat{\gamma}_y$ of the conditional Ising model and the GMLM parameters $\mat{\beta}_k$ and $\mat{\Omega}_k$ is


where we set $\overline{\ten{\eta}} = 0$ and $\mat{T}_2$ to the identity. This imposes an additional constraint to the model, the reason is that the diagonal elements of $\mat{\Omega} = \bigkron_{k = r}^{1}\mat{\Omega}_k$ take the role of $\overline{\ten{\eta}}$, although not fully. Having the diagonal of $\mat{\Omega}$ and $\overline{\ten{\eta}}$ handling the self interaction effects might lead to interference in the optimization routine. Another approach would be to use the $\mat{T}_2$ matrix to set the corresponding diagonal elements of $\mat{\Omega}$ to zero and let $\overline{\ten{\eta}}$ handle the self interaction effect. All of these approaches, namely setting $\overline{\ten{\eta}} = 0$, keeping $\overline{\ten{\eta}}$ or using $\mat{T}_2$, are theoretically solid and compatible with \cref{thm:grad,thm:parammanifold,thm:asymptoticnormalitygmlm}, assuming all axis dimensions $p_k$ are nondegenerate, that is $p_k > 1$ for all $k = 1, \ldots, r$. Regardless, under our modeling choice, the relation between the natural parameters $\mat{\gamma}_y$ of the conditional Ising model and the GMLM parameters $\mat{\beta}_k$ and $\mat{\Omega}_k$ is


\begin{equation}\label{eq:isingnaturalparams}


% \t{\pinv{\mat{D}_p}}\mat{\gamma}_y


% = \vec(\mat{\Omega} + \diag(\mat{B}\vec{\ten{F}_y}))



@ 648,41 +658,41 @@ where we set $\overline{\ten{\eta}} = 0$ and $\mat{T}_2$ to the identity. This i


\end{equation}


In contract to the tensor normal GMLM, the matrices $\mat{\Omega}_k$ are only required to be symmetric. More specifically, we require $\mat{\Omega}_k$, for $k = 1, \ldots, r$, to be elements of an embedded submanifold of $\SymMat{p_k}$ (see: \cref{sec:kronmanifolds,sec:matrixmanifolds}). The mode wise reduction matrices $\mat{\beta}_k$ are elements of an embedded submanifold of $\mathbb{R}^{p_k\times q_k}$. Common choices are listed in \cref{sec:matrixmanifolds}.




To solve the optimization problem \eqref{eq:mle}, given a data set $(\ten{X}_i, y_i)$, for $i = 1, \ldots, n$, we use a variation of gradient descent.


To solve the optimization problem \eqref{eq:mle}, given a data set $(\ten{X}_i, y_i)$, $i = 1, \ldots, n$, we use a variation of gradient descent.




%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\subsubsection{Initial Values}




The first step is to get reasonable starting values. Experiments showed that a good starting value of $\mat{\beta}_k$, for $k = 1, \ldots, r$, it to use the tensor normal estimates from \cref{sec:tensornormalestimation}, considering $\ten{X}_i$ as continuous. For initial values of $\mat{\Omega}_k$, a different approach is required. Setting everything to the uninformed initial value, that is $\mat{\Omega}_k = \mat{0}$ as this corresponds to the conditional log odds to be $1:1$ for every component and pairwaide interaction. This is not possible, since $\mat{0}$ is a stationary point of the loglikelihood. This is directly observed by taking a look at the partial gradients of the loglikelihood in \cref{thm:grad}. Instead, we use a crude heuristic which threads every mode seperately and ignores any relation to the covariates. It is computationaly cheap and better than any of the alternatives we considered. For every $k = 1, \ldots, r$, let the $k$'th mode second moment estimate be


The first step is to get reasonable starting values. Experiments showed that a good starting value of $\mat{\beta}_k$ is to use the tensor normal estimates from \cref{sec:tensornormalestimation} for $k = 1, \ldots, r$, considering $\ten{X}_i$ as continuous. For initial values of $\mat{\Omega}_k$, a different approach is required. Setting everything to the uninformed initial value, that is $\mat{\Omega}_k = \mat{0}$ as this corresponds to the conditional log odds to be $1:1$ for every component and pairwise interaction. This is not possible, since $\mat{0}$ is a stationary point of the loglikelihood. This is directly observed by considering the partial gradients of the loglikelihood in \cref{thm:grad}. Instead, we use a crude heuristic which threads every mode seperately and ignores any relation to the covariates. It is computationaly cheap and better than any of the alternatives we considered. For every $k = 1, \ldots, r$, let the $k$th mode second moment estimate be


\begin{equation}\label{eq:isingmodemoments}


\hat{\mat{M}}_{2(k)} = \frac{p_k}{n p}\sum_{i = 1}^n (\ten{X}_i)_{(k)}\t{(\ten{X}_i)_{(k)}}


\end{equation}


which contains the $k$'th mode first moment estimate in its diagonal $\hat{\mat{M}}_{1(k)} = \diag\hat{\mat{M}}_{2(k)}$. Considering every column of the matricized observation $(\ten{X}_i)_{(k)}$ as a $p_k$ dimensional observation itself. The number of those artificially generated observations is $n \prod_{j\neq k}p_j$. Let $Z_k$ denote the random variable those artificial observations are realization of. Then, we can interpret the elements $(\hat{\mat{M}}_{1(k)})_{j}$ as the estimates of the probability $P((Z_k)_j = 1)$, that is the marginal probability of the $j$th element of $Z_k$ being $1$. Similar, for $l \neq j$ we have $(\hat{\mat{M}}_{2(k)})_{j l}$ estimating $P((Z_k)_j = 1, (Z_k)_l = 1)$, the marginal probability of twoway interactions. % Without any regard of accuracy ...


Now, we set the diagonal elements of $\mat{\Omega}_k$ to zero. For the off diagonal elements of $\mat{\Omega}_k$, we equate the conditional probabilities $P((Z_k)_j = 1 \mid (Z_k)_{j} = \mat{0})$ and $P((Z_k)_j = 1, (Z_k)_l = 1\mid (Z_k)_{j, l} = \mat{0})$ with the marginal probability estimates $(\hat{\mat{M}}_{1(k)})_{j}$ and $(\hat{\mat{M}}_{2(k)})_{j l}$, respectively. Use \eqref{eq:isingtwowaylogodds} then gives the initial estimates $\hat{\mat{\Omega}}_k^{(0)}$, with $j \neq l$ component wise


which contains the $k$th mode first moment estimate in its diagonal $\hat{\mat{M}}_{1(k)} = \diag\hat{\mat{M}}_{2(k)}$. Considering every column of the matricized observation $(\ten{X}_i)_{(k)}$ as a $p_k$ dimensional observation. The number of those artificially generated observations is $n \prod_{j\neq k}p_j$. Let $Z_k$ denote the random variable those artificial observations are realization of. Then, we can interpret the elements $(\hat{\mat{M}}_{1(k)})_{j}$ as the estimates of the probability $P((Z_k)_j = 1)$, that is the marginal probability of the $j$th element of $Z_k$ being $1$. Similar, for $l \neq j$ we have $(\hat{\mat{M}}_{2(k)})_{j l}$ estimating $P((Z_k)_j = 1, (Z_k)_l = 1)$, the marginal probability of twoway interactions. % Without any regard of accuracy ...


Now, we set the diagonal elements of $\mat{\Omega}_k$ to zero. For the off diagonal elements of $\mat{\Omega}_k$, we equate the conditional probabilities $P((Z_k)_j = 1 \mid (Z_k)_{j} = \mat{0})$ and $P((Z_k)_j = 1, (Z_k)_l = 1\mid (Z_k)_{j, l} = \mat{0})$ with the marginal probability estimates $(\hat{\mat{M}}_{1(k)})_{j}$ and $(\hat{\mat{M}}_{2(k)})_{j l}$, respectively. Applying \eqref{eq:isingtwowaylogodds} then gives the initial componentwise estimates $\hat{\mat{\Omega}}_k^{(0)}$,


\begin{equation}\label{eq:isinginitOmegas}


(\hat{\mat{\Omega}}_k^{(0)})_{j j} = 0,


\qquad


(\hat{\mat{\Omega}}_k^{(0)})_{j l} = \log\frac{1  (\hat{\mat{M}}_{1(k)})_{j}(\hat{\mat{M}}_{1(k)})_{l}}{(\hat{\mat{M}}_{1(k)})_{j}(\hat{\mat{M}}_{1(k)})_{l}}\frac{(\hat{\mat{M}}_{2(k)})_{j l}}{1  (\hat{\mat{M}}_{2(k)})_{j l}}.


(\hat{\mat{\Omega}}_k^{(0)})_{j l} = \log\frac{1  (\hat{\mat{M}}_{1(k)})_{j}(\hat{\mat{M}}_{1(k)})_{l}}{(\hat{\mat{M}}_{1(k)})_{j}(\hat{\mat{M}}_{1(k)})_{l}}\frac{(\hat{\mat{M}}_{2(k)})_{j l}}{1  (\hat{\mat{M}}_{2(k)})_{j l}}, \, j \neq l.


\end{equation}




%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\subsubsection{Gradient Optimization}


Given initial values, the gradients provided by \cref{thm:grad} can be evaluated for the Ising model. The first step therefore is to determine the values of the inverse link components $\ten{g}_1(\mat{\gamma}_y) = \E[\ten{X} \mid Y = y]$ and $\ten{G}_2(\mat{\gamma}_y) = \ten{g}_2(\mat{\gamma}_y) = \E[\ten{X}\circ\ten{X} \mid Y = y]$. An immediate simplification is that the first moment is a part of the second moment. Its values are determined via $\vec(\E[\ten{X} \mid Y = y]) = \diag(\E[\ten{X}\circ\ten{X} \mid Y = y]_{(1, \ldots, r)})$. This means only the second moment needs to be computed, or estimated (see: \cref{sec:isingbiggerdim}) in the case of slightly bigger $p$. For the Ising model, the conditional second moment with parameters $\mat{\gamma}_y$ is given by the matricized relation


Given initial values, the gradients derived in \cref{thm:grad} can be evaluated for the Ising model. The first step therefore is to determine the values of the inverse link components $\ten{g}_1(\mat{\gamma}_y) = \E[\ten{X} \mid Y = y]$ and $\ten{G}_2(\mat{\gamma}_y) = \ten{g}_2(\mat{\gamma}_y) = \E[\ten{X}\circ\ten{X} \mid Y = y]$. An immediate simplification is that the first moment is a part of the second moment. Its values are determined via $\vec(\E[\ten{X} \mid Y = y]) = \diag(\E[\ten{X}\circ\ten{X} \mid Y = y]_{(1, \ldots, r)})$. This means only the second moment needs to be computed, or estimated (see: \cref{sec:isingbiggerdim}) in the case of slightly bigger $p$. For the Ising model, the conditional second moment with parameters $\mat{\gamma}_y$ is given by the matricized relation


\begin{equation}\label{eq:isingm2}


\ten{g}_2(\ten{\gamma}_y)_{(1, \ldots, r)} = \E[(\vec{\ten{X}})\t{(\vec{\ten{X}})}\mid Y = y] = p_0(\mat{\gamma}_y)\sum_{\mat{x}\in\{0, 1\}^{p}}\mat{x}\t{\mat{x}}\exp(\t{\vech(\mat{x}\t{\mat{x}})}\mat{\gamma}_y).


\ten{g}_2(\ten{\gamma}_y)_{(1, \ldots, r)} = \E\left[(\vec{\ten{X}})\t{(\vec{\ten{X}})}\mid Y = y\right] = p_0(\mat{\gamma}_y)\sum_{\mat{x}\in\{0, 1\}^{p}}\mat{x}\t{\mat{x}}\exp(\t{\vech(\mat{x}\t{\mat{x}})}\mat{\gamma}_y).


\end{equation}


The natural parameter $\mat{\gamma}_y$ is evaluated via \eqref{eq:isingnaturalparams} enabling us to compute the partial gradients of the loglikelihood $l_n$ \eqref{eq:loglikelihood} for the Ising model by \cref{thm:grad} for the GMLM parameters $\mat{\beta}_k$ and $\mat{\Omega}_k$, $k = 1, \ldots, r$, at the current iterate $\mat{\theta}^{(I)} = (\mat{\beta}_1^{(I)}, \ldots, \mat{\beta}_r^{(I)}, \mat{\Omega}_1^{(I)}, \ldots, \mat{\Omega}_r^{(I)})$. Using classic gradient ascent for maximizing the loglikelihood, we have to specify a learning rate $\lambda\in\mathbb{R}_{+}$, usualy something like $10^{3}$. The update rule is


The natural parameter $\mat{\gamma}_y$ is evaluated via \eqref{eq:isingnaturalparams} enabling us to compute the partial gradients of the loglikelihood $l_n$ \eqref{eq:loglikelihood} for the Ising model by \cref{thm:grad} for the GMLM parameters $\mat{\beta}_k$ and $\mat{\Omega}_k$, $k = 1, \ldots, r$, at the current iterate $\mat{\theta}^{(I)} = (\mat{\beta}_1^{(I)}, \ldots, \mat{\beta}_r^{(I)}, \mat{\Omega}_1^{(I)}, \ldots, \mat{\Omega}_r^{(I)})$. Using classic gradient ascent for maximizing the loglikelihood, we have to specify a learning rate $\lambda\in\mathbb{R}_{+}$, usually a value close to $10^{3}$. The update rule is


\begin{displaymath}


\mat{\theta}^{(I + 1)} = \mat{\theta}^{(I)} + \lambda\nabla_{\mat{\theta}} l_n(\mat{\theta})\bigr_{\mat{\theta} = \mat{\theta}^{(I)}},


\end{displaymath}


which is iterated till convergence. In practice, iteration is performed until ether a maximum number of iterations is exhausted and/or some break condition is satisfied. A proper choice of the learning rate is needed as a too large learning rate $\lambda$ causes instabilities, while a too low learning rate requires an enormous amount of iterations. Generically, there are two approach against the need to determine a proper learning rate. First, \emph{line search methods} determin an appropriate step size for every iteration. This works great if the evaluation of the object function (the loglikelihood) is cheap. This is not the case in our setting, see \cref{sec:isingbiggerdim}. The second approach is an \emph{adaptive learning rate}. The basic idea is to track specific statistics while optimizing and dynamically adapt the leaning rate via well tested heuristics using the gathered knowledge from past iterations. We opted to use an adaptive leaning rate approach, this not only removes the need to determine an appropriate leaning rate, but also accelerates learning.


which is iterated till convergence. In practice, iteration is performed until either a maximum number of iterations is exhausted and/or some break condition is satisfied. A proper choice of the learning rate is needed as a large learning rate $\lambda$ may cause instability, while a very low learning rate requires an enormous amount of iterations. Generically, there are two approaches to avoid the need to determine a proper learning rate. First, \emph{line search methods} determine an appropriate step size for every iteration. This works well if the evaluation of the object function (the loglikelihood) is cheap. This is not the case in our setting, see \cref{sec:isingbiggerdim}. The second approach is an \emph{adaptive learning rate}, where one tracks specific statistics while optimizing and dynamically adapting the learning rate via welltested heuristics using the gathered knowledge from past iterations. We opted to use an adaptive learning rate approach, which not only removes the need to determine an appropriate learning rate, but also accelerates learning.




Our method of choice is RMSprop, which stands for \emph{root mean squared propagation} \textcite{Hinton2012}. This is a well known method in machine learning for training neural networks. It is a variation of gradient descent with a per scalar parameter adaptive learning rate. It tracks a moving average of the element wise squared gradient $\mat{g}_2^{(I)}$, which is then used to scale (element wise) the gradient in the update rule. See \textcite{Hinton2012,GoodfellowEtAl2016} among others. The update rule using RMSprop for maximization\footnote{Instead of the more common minimization, therefore $+$ in the update of $\mat{\theta}$.} is


Our method of choice is \emph{root mean squared propagation} (RMSprop) \parencite{Hinton2012}. This is a well known method in machine learning for training neural networks. It is a variation of gradient descent with a per scalar parameter adaptive learning rate. It tracks a moving average of the element wise squared gradient $\mat{g}_2^{(I)}$, which is then used to scale (element wise) the gradient in the update rule. See \textcite{Hinton2012,GoodfellowEtAl2016} among others. The update rule using RMSprop for maximization\footnote{Instead of the more common minimization, therefore $+$ in the update of $\mat{\theta}$.} is


\begin{align*}


\mat{g}_2^{(I + 1)} &= \nu \mat{g}_2^{(I)} + (1  \nu)\nabla l_n(\mat{\theta}^{(I)})\odot\nabla l_n(\mat{\theta}^{(I)}), \\


\mat{\theta}^{(I + 1)} &= \mat{\theta}^{(I)} + \frac{\lambda}{\sqrt{\mat{g}_2^{(I + 1)}} + \epsilon}\odot\nabla l_n(\mat{\theta}^{(I)}).


\end{align*}


The parameters $\nu = 0.9$, $\lambda = 10^{3}$ and $\epsilon\approx 1.49\cdot 10^{8}$ are fixed. The initial value of $\mat{g}_2^{(0)} = \mat{0}$, the symbol $\odot$ denotes the Hadamard product, that is the element wise multiplication. The division and square root operation are performed element wise as well. According to our experiments, RMSprop requires in the range of $50$ till $1000$ iterations till convergence while gradient ascent with a learning rate of $10^{3}$ is in the range of $1000$ till $10000$.


The parameters $\nu = 0.9$, $\lambda = 10^{3}$ and $\epsilon\approx 1.49\cdot 10^{8}$ are fixed. The initial value of $\mat{g}_2^{(0)} = \mat{0}$, the symbol $\odot$ denotes the Hadamard product, or element wise multiplication. The division and square root operation are performed element wise as well. According to our experiments, RMSprop requires iterations in the range of $50$ till $1000$ till convergence while gradient ascent with a learning rate of $10^{3}$ is in the range of $1000$ till $10000$.




%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\subsubsection{Small Data Sets}\label{sec:isingsmalldatasets}



@ 1135,85 +1145,20 @@ Next, going over the PSQTs one by one, a few words about the preferred positions


The results of our analysis in the previous paragraph agree with the configuration of the chess board most associated with observed chess game outcomes. This arrangement also aligns with the understanding of human chess players of an average configuration at any moment during the game.




\section{Discussion}


We have addressed sufficient dimension reduction for tensor valued predictors for regression or classification problems. Proposing a generalized multilinear model modeling the inverse conditional distribution we provided a multilinear sufficient reduction with consistent and asymptotic normal parameters. Moreover, our ansatz for proving the asymptotic results required by leveraging manifolds as a basis for resolving the issue of unidentifiable parameters lead to an even more flexible modeling framework. This allows to build complex and potentially problem specific parameter spaces incorporating additional domain specific knownledge into the model.




An additional powerful extension of our model involves considering a sum of separable Kronecker predictors. This is motivated by the equivalence of a Kronecker product to a rank 1 tensor. By allowing a sum of a few separable Kronecker predictors, we remove the implicit rank 1 constraint. However, if this extension is to be applied to the SDR setting, as in this paper, it is crucial to ensure that the sum of Kronecker products forms a parameter manifold to apply our theory. While we anticipate that this approach can lead to intriguing and powerful models, there are certain details that need to be resolved first.




\todo{finish!}




%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\printbibliography[heading=bibintoc, title={References}]


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%




%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\appendix


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%




\section{Examples}\label{app:examples}




\begin{example}[Vectorization]\label{ex:vectorization}


Given a matrix


\begin{displaymath}


\mat{A} = \begin{pmatrix}


1 & 4 & 7 \\


2 & 5 & 8 \\


3 & 6 & 9


\end{pmatrix}


\end{displaymath}


its vectorization is $\vec{\mat{A}} = \t{(1, 2, 3, 4, 5, 6, 7, 8, 9)}$ and its half vectorization $\vech{\mat{A}} = \t{(1, 2, 3, 5, 6, 9)}$. Let a $\ten{A}$ be a tensor of dimension $3\times 3\times 3$ given by


\begin{displaymath}


\ten{A}_{:,:,1} = \begin{pmatrix}


1 & 4 & 7 \\


2 & 5 & 8 \\


3 & 6 & 9


\end{pmatrix},


\qquad


\ten{A}_{:,:,2} = \begin{pmatrix}


10 & 13 & 16 \\


11 & 14 & 17 \\


12 & 15 & 18


\end{pmatrix},


\qquad


\ten{A}_{:,:,3} = \begin{pmatrix}


19 & 22 & 25 \\


20 & 23 & 26 \\


21 & 24 & 27


\end{pmatrix}.


\end{displaymath}


Then the vectorization of $\ten{A}$ is given by


\begin{displaymath}


\vec{\ten{A}} = \t{(1, 2, 3, 4, ..., 26, 27)}\in\mathbb{R}^{27}.


\end{displaymath}


\end{example}




\begin{example}[Matricization]


Let $\ten{A}$ be the $3\times 4\times 2$ tensor given by


\begin{displaymath}


\ten{A}_{:,:,1} = \begin{pmatrix}


1 & 4 & 7 & 10 \\


2 & 5 & 8 & 11 \\


3 & 6 & 9 & 12


\end{pmatrix},


\ten{A}_{:,:,2} = \begin{pmatrix}


13 & 16 & 19 & 22 \\


14 & 17 & 20 & 23 \\


15 & 18 & 21 & 24


\end{pmatrix}.


\end{displaymath}


Its matricizations are then


\begin{gather*}


\ten{A}_{(1)} = \begin{pmatrix}


1 & 4 & 7 & 10 & 13 & 16 & 19 & 22 \\


2 & 5 & 8 & 11 & 14 & 17 & 20 & 23 \\


3 & 6 & 9 & 12 & 15 & 18 & 21 & 24


\end{pmatrix},


\qquad


\ten{A}_{(2)} = \begin{pmatrix}


1 & 2 & 3 & 13 & 14 & 15 \\


4 & 5 & 6 & 16 & 17 & 18 \\


7 & 8 & 9 & 19 & 20 & 21 \\


10 & 11 & 12 & 22 & 23 & 24


\end{pmatrix}, \\


\ten{A}_{(3)} = \begin{pmatrix}


1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 & 11 & 12 \\


13 & 14 & 15 & 16 & 17 & 18 & 19 & 20 & 21 & 22 & 23 & 24


\end{pmatrix}.


\end{gather*}


\end{example}






%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\section{Tensor Calculus and Multi Linear Algebra}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%



@ 1776,9 +1721,77 @@ The following is a technical Lemma used in the proof of \cref{thm:asymptoticnor


\end{proof}






\section{Examples}\label{app:examples}




\begin{example}[Vectorization]\label{ex:vectorization}


Given a matrix


\begin{displaymath}


\mat{A} = \begin{pmatrix}


1 & 4 & 7 \\


2 & 5 & 8 \\


3 & 6 & 9


\end{pmatrix}


\end{displaymath}


its vectorization is $\vec{\mat{A}} = \t{(1, 2, 3, 4, 5, 6, 7, 8, 9)}$ and its half vectorization $\vech{\mat{A}} = \t{(1, 2, 3, 5, 6, 9)}$. Let a $\ten{A}$ be a tensor of dimension $3\times 3\times 3$ given by


\begin{displaymath}


\ten{A}_{:,:,1} = \begin{pmatrix}


1 & 4 & 7 \\


2 & 5 & 8 \\


3 & 6 & 9


\end{pmatrix},


\qquad


\ten{A}_{:,:,2} = \begin{pmatrix}


10 & 13 & 16 \\


11 & 14 & 17 \\


12 & 15 & 18


\end{pmatrix},


\qquad


\ten{A}_{:,:,3} = \begin{pmatrix}


19 & 22 & 25 \\


20 & 23 & 26 \\


21 & 24 & 27


\end{pmatrix}.


\end{displaymath}


Then the vectorization of $\ten{A}$ is given by


\begin{displaymath}


\vec{\ten{A}} = \t{(1, 2, 3, 4, ..., 26, 27)}\in\mathbb{R}^{27}.


\end{displaymath}


\end{example}




\begin{example}[Matricization]


Let $\ten{A}$ be the $3\times 4\times 2$ tensor given by


\begin{displaymath}


\ten{A}_{:,:,1} = \begin{pmatrix}


1 & 4 & 7 & 10 \\


2 & 5 & 8 & 11 \\


3 & 6 & 9 & 12


\end{pmatrix},


\ten{A}_{:,:,2} = \begin{pmatrix}


13 & 16 & 19 & 22 \\


14 & 17 & 20 & 23 \\


15 & 18 & 21 & 24


\end{pmatrix}.


\end{displaymath}


Its matricizations are then


\begin{gather*}


\ten{A}_{(1)} = \begin{pmatrix}


1 & 4 & 7 & 10 & 13 & 16 & 19 & 22 \\


2 & 5 & 8 & 11 & 14 & 17 & 20 & 23 \\


3 & 6 & 9 & 12 & 15 & 18 & 21 & 24


\end{pmatrix},


\qquad


\ten{A}_{(2)} = \begin{pmatrix}


1 & 2 & 3 & 13 & 14 & 15 \\


4 & 5 & 6 & 16 & 17 & 18 \\


7 & 8 & 9 & 19 & 20 & 21 \\


10 & 11 & 12 & 22 & 23 & 24


\end{pmatrix}, \\


\ten{A}_{(3)} = \begin{pmatrix}


1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 & 11 & 12 \\


13 & 14 & 15 & 16 & 17 & 18 & 19 & 20 & 21 & 22 & 23 & 24


\end{pmatrix}.


\end{gather*}


\end{example}




%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\printbibliography[heading=bibintoc, title={References}]


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%




\end{document}




Loading…
Reference in New Issue