sync

2024-03-28 14:17:40 +01:00 · 2024-03-28 14:17:40 +01:00 · 0cb7772132
commit 0cb7772132
parent db9c3f4794
1 changed files with 163 additions and 256 deletions
--- a/LaTeX/paper.tex
+++ b/LaTeX/paper.tex
@ -1,9 +1,11 @@
 \documentclass[a4paper, 10pt]{article}

+\usepackage{arxiv}              % Paper Style/Theme
+
 \usepackage[utf8]{inputenc}
 \usepackage[LSF, T1]{fontenc}
 \usepackage{chessfss}
-\usepackage{fullpage}
+% \usepackage{fullpage}
 \usepackage{amsmath, amssymb, amstext, amsthm, scalerel, bm, pifont}
 \usepackage[dvipsnames]{xcolor}         % dvipsnames loads more named colors
 \usepackage{graphicx}                   % colors, images and graphics
@ -40,18 +42,31 @@
 \setcounter{MaxMatrixCols}{20}  % Sets the max nr. of columns in AMSmath's matrix envs to 20

 % Document meta into
-\title{GMLM Paper}
-\author{Daniel Kapla}
+\title{Generalized Multi-Linear Models for Sufficient Dimension Reduction on Tensor Valued Predictors}
+\author{
+    \textbf{Daniel Kapla} \\ % \thanks{daniel.kapla@tuwien.ac.at} \\
+    Institute of Statistics and Mathematical Methods in Economics \\
+    Faculty of Mathematics and Geoinformation \\
+    TU Wien, Vienna, Austria
+    \And \\
+    \textbf{Efstathia Bura} \\ % \thanks{efstathia.bura@tuwien.ac.at} \\
+    \hspace{.2cm}
+    Institute of Statistics and Mathematical Methods in Economics \\
+    Faculty of Mathematics and Geoinformation \\
+    TU Wien, Vienna, Austria
+}
 \date{\today}
 % Set PDF title, author and creator.
 \AtBeginDocument{
    \hypersetup{
-        pdftitle = {GMLM Paper},
+        pdftitle = {GMLM SDR},
        pdfauthor = {Daniel Kapla},
        pdfcreator = {\pdftexbanner}
    }
 }

+
+
 % Bibliography resource(s)
 \addbibresource{main.bib}

@ -72,7 +87,7 @@
 \newtheorem{remark}{Remark}


-\crefalias{section}{appendix}   % ???
+\crefalias{section}{appendix}

 \crefname{condition}{Condition}{Conditions}
 \Crefname{condition}{Condition}{Conditions}
@ -146,8 +161,8 @@
 \newcommand{\SymPosDefMat}[1]{\mathrm{Sym}_{++}^{{#1}\times {#1}}}
 \newcommand{\SymDiagZeroMat}[1]{\mathrm{Sym}_{\diag=0}^{p\times p}}
 \newcommand{\SymBand}[2]{\mathrm{SymBand}^{{#1}\times {#1}}_{#2}}
-\newcommand{\UnitaryGrp}[1]{\mathrm{U}(#1)}
-\newcommand{\SpecialUnitaryGrp}[1]{\mathrm{SU}(#1)}
+\newcommand{\OrthogonalGrp}[1]{\mathrm{O}(#1)}
+\newcommand{\SpecialOrthogonalGrp}[1]{\mathrm{SO}(#1)}


 \newcommand{\todo}[1]{{\color{red}TODO: #1}}
@ -254,33 +269,33 @@

 \begin{document}

-%\tableofcontents
+\maketitle

-%------------------------------------------------------------------------------%
+\begin{abstract}
+    We consider regression or classification problems where the independent variable is matrix- or tensor valued. Modeling the inverse regression as a member of the quadratic exponential family, we derive a multilinear sufficient reduction for the regression or classification problem. Using manifold theory, we prove the consistency and asymptotic normality of the sufficient reduction. For continuous
+    tensor-valued predictors, we develop a computationally efficient estimation procedure of
+    their sufficient reductions, which is also applicable to situations where the dimension of
+    the reduction exceeds the available sample size. An estimation procedure for binary tensor-valued data is also provided. We conclude with simulations and real world data examples for both continuous and binary tensor-valued predictors.
+\end{abstract}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{Introduction}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 In Statistics, tensors are a mathematical tool to represent data of complex structure. In this paper, \textit{tensors} are considered as a generalization of matrices to higher dimensions: A tensor is a multi-dimensional array of numbers. For example, a second-order tensor can be represented as a matrix, while a third-order tensor can be represented as a cube of matrices. 

-Complex data are collected at different times and/or under several conditions often involving a large number of multi-indexed variables represented as tensor-valued data \parencite{KoldaBader2009}. They occur in large-scale longitudinal studies \parencite[e.g.[]{Hoff2015}, in agricultural experiments and chemometrics and spectroscopy \parencite[e.g.[]{LeurgansRoss1992,Burdick1995}. Also, in signal and video processing where sensors produce multi-indexed data, e.g. over spatial, frequency, and temporal dimensions \parencite[e.g.]{DeLathauwerCastaing2007,KofidisRegalia2005}, in telecommunications \parencite{DeAlmeidaEtAl2007}.
-Examples of multiway data include 3D images of the brain, where the modes are the 3 spatial dimensions,
-and spatio-temporal weather imaging data, a set of image sequences represented as 2 spatial modes and 1 temporal mode.
+Complex data are collected at different times and/or under several conditions often involving a large number of multi-indexed variables represented as tensor-valued data \parencite{KoldaBader2009}. They occur in large-scale longitudinal studies \parencite[e.g.][]{Hoff2015}, in agricultural experiments and chemometrics and spectroscopy \parencite[e.g.][]{LeurgansRoss1992,Burdick1995}. Also, in signal and video processing where sensors produce multi-indexed data, e.g. over spatial, frequency, and temporal dimensions \parencite[e.g.][]{DeLathauwerCastaing2007,KofidisRegalia2005}, in telecommunications \parencite[e.g.][]{DeAlmeidaEtAl2007}. Other examples of multiway data include 3D images of the brain, where the modes are the 3 spatial dimensions, and spatio-temporal weather imaging data, a set of image sequences represented as 2 spatial modes and 1 temporal mode.

-%In fact, it will be shown that scientific research cannot do without multiway analysis, even though not everybody knows this yet.
-
-
-
-% A nice work, particularly focusing on the need to go from matrix-variate to tensor-based MLN distribution, is given in \textcite{BasserPajevic2003}. They genuinely argue why a vectorial treatment of a complex data set which actually needs a tensorial treatment and the application of multilinear normality, can lead to wrong or inefficient conclusions. For some more relevant work in the same direction, see \textcite{BasserPajevic2000,BasserPajevic2007,LeBihanEtAl2001}, and the references cited therein, whereas a Bayesian perspective is given in \textcite{MartinFernandez2004}; see also \textcite{Hoff2011}. Analysis of multilinear, particularly trilinear data, has a specific attraction in chemometrics and spectroscopy; see for example \textcite{LeurgansRoss1992}, \textcite{Burdick1995}. Other areas of applications include signal processing \textcite{KofidisRegalia2005}, morphometry \textcite{LeporeEtAl2008}, geostatistics \textcite{LiuKoike2007}, and statistical mechanics \textcite{Soize2008}, to mention a few. The extensive use of tensor variate analysis in these and other similar fields has generated a special tensorial nomenclature, for example diffusion tensor, dyadic tensor, stress and strain tensors etc. \textcite{McCullagh1987}. Similarly, special tensorial decompositions, for example PARAFAC and Tucker decompositions \textcite{Kroonenberg2008}, have been developed; for a general comprehensive review of tensor decompositions and their various applications, see \textcite{Comon2002,KoldaBader2009,ShashuaHazan2005}.
-
-%From \textcite{SongHero2023}:
-%Modern datasets and their associated probabilistic models often exhibit a multiway nature, involving a massive number of multi-indexed variables represented as a tensor \parencite{KoldaBader2009}. This structure poses significant challenges for data scientists due to the large number of variables (exponential with respect to the tensor order). For example, an order-3 tensor of size $d \times d \times d$ involves $d^3$ variables, and representing a real-valued tensor with $d \ge 10^4$ already requires more than 1TB of memory. As data tensors grow in size, storage and processing complexities for the large data arrays increase quadratically in relation to the data size.
+Multilinear tensor normal models have been used in various applications, including medical imaging \parencite{BasserPajevic2007,DrydenEtAl2009}, spatio-temporal data analysis \parencite{GreenewaldHero2014}, regression analysis for longitudinal relational data \parencite{Hoff2015}. One of the most important uses of the multilinear normal (MLN) distribution, and hence tensor analysis, is perhaps in magnetic resonance imaging (MRI) \parencite{OhlsonEtAl2013}. A recent survey \parencite{WangEtAl2022} and references therein contain more information and potential applications of multilinear tensor normal models.

+In this paper we present a model-based \emph{Sufficient Dimension Reduction} (SDR) method for tensor-valued data with distribution in the quadratic exponential family assuming a Kronecker product structure of the first and second moment. By generalizing the parameter space to embedded manifolds we obtain consistency and asymtotic normality results while allowing great modeling flexibility in the linear sufficient dimension reduction.

+The structure of this paper is as follows. In \cref{sec:notation} we introduce our notation. \Cref{sec:problem-formulation} decribes the exact problem and in \cref{sec:gmlm-model} we introduce our model. Continuing in \cref{sec:ml-estimation} we provide the basis for a general maximum likelihood estimation procedure and derive specialized methods for tensor normal as well as the tensor Ising distributions. \Cref{sec:manifolds} gives a short introduction into manifolds and provides the basis for applying the consistency and asymtotic normality results from \cref{sec:asymtotics}. Simulations for continuous and binary predicotrs are subject of \cref{sec:simulations}. Finally, in \cref{sec:data-analysis} we apply our model to EEG data and perform a prove of concept data analysis example where a chess board is interprated as a collection of binary $8\times 8$ matrices.

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\section{Notation}
+\section{Notation}\label{sec:notation}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-%We  introduce the notation throughout the paper. 
-$\ten{A} = (\ten{A}_{i_1, \ldots, i_r})\in\mathbb{R}^{q_1\times \ldots\times q_r}$ denotes an order\footnote{Also referred to as rank, therefore the variable name $r$, but this term is \emph{not} used as it leads to confusion with the concept of rank of a matrix.} $r$ tensor, where $r\in\mathbb{N}$ is the number of modes or axes (dimensions) of $\ten{A}$ and $\ten{A}_{i_1,...,i_r} \in \mathbb{R}$ is its $(i_1, \ldots, i_r)$th entry. For example, a $p \times q$ matrix  $\mat{B}$ has two modes, the rows and columns.  For matrices $\mat{B}_k\in\mathbb{R}^{p_k\times q_k}$, $k\in[r] = \{1, 2, \ldots, r\}$, the \emph{multi-linear multiplication}, or \emph{Tucker operator} \parencite{Kolda2006}, is defined element wise as
+Let $\ten{A}\in\mathbb{R}^{q_1\times \ldots\times q_r}$ denotes an order\footnote{Also referred to as rank, therefore the variable name $r$, but this term is \emph{not} used as it leads to confusion with the concept of rank of a matrix.} $r$ tensor, where $r\in\mathbb{N}$ is the number of modes or axes (dimensions) of $\ten{A}$ and $\ten{A}_{i_1,...,i_r} \in \mathbb{R}$ is its $(i_1, \ldots, i_r)$th entry. For example, a $p \times q$ matrix  $\mat{B}$ has two modes, the rows and columns.  For matrices $\mat{B}_k\in\mathbb{R}^{p_k\times q_k}$, $k\in[r] = \{1, 2, \ldots, r\}$, the \emph{multi-linear multiplication}, or \emph{Tucker operator} \parencite{Kolda2006}, is defined element wise as
 \begin{displaymath}
    (\ten{A}\times\{\mat{B}_1, \ldots, \mat{B}_r\})_{j_1, \ldots, j_r} = \sum_{i_1, \ldots, i_r = 1}^{q_1, \ldots, q_r} \ten{A}_{i_1, \ldots, i_r}(\mat{B}_{1})_{j_1, i_1} \cdots (\mat{B}_{r})_{j_r, i_r}
 \end{displaymath}
@ -288,48 +303,52 @@ which results in an order $r$ tensor of dimension $p_1\times ...\times p_k$. The
 \begin{displaymath}
    \ten{A}\times_k\mat{B}_k = \ten{A}\times\{\mat{I}_{q_1}, \ldots, \mat{I}_{q_{k-1}}, \mat{B}_{k}, \mat{I}_{q_{k+1}}, \ldots, \mat{I}_{q_r}\}.
 \end{displaymath}
-The notation $\ten{A}\mlm_{k\in S}\mat{B}_k$ is short hand for the iterative application of the mode product for all indices in $S\subseteq[r]$. For example $\ten{A}\times_{k\in\{2, 5\}}\mat{B}_k = \ten{A}\times_2\mat{B}_2\times_5\mat{B}_5$. By only allowing $S$ to be a set, this notation is unambiguous because the mode product commutes for different modes; i.e., $\ten{A}\times_j\mat{B}_j\times_k\mat{B}_k = \ten{A}\times_k\mat{B}_k\times_j\mat{B}_j$ for $j\neq k$.
+The notation $\ten{A}\mlm_{k\in S}\mat{B}_k$ is short hand for the iterative application of the mode product for all indices in $S\subseteq[r]$. For example $\ten{A}\mlm_{k\in\{2, 5\}}\mat{B}_k = \ten{A}\times_2\mat{B}_2\times_5\mat{B}_5$. By only allowing $S$ to be a set, this notation is unambiguous because the mode product commutes for different modes; i.e., $\ten{A}\times_j\mat{B}_j\times_k\mat{B}_k = \ten{A}\times_k\mat{B}_k\times_j\mat{B}_j$ for $j\neq k$. For example, let $\mat{A}, \mat{B}_1, \mat{B}_2$ be matrices (of matching dimensions). The bilinear mode-product and multi-linear multiplication relate to the well known matrix-matrix multiplications as
+\begin{displaymath}
+    \mat{A}\times_1\mat{B}_1 = \mat{B}_1\mat{A}, \qquad
+    \mat{A}\times_2\mat{B}_2 = \mat{A}\t{\mat{B}_2}, \qquad
+    \mat{A}\mlm_{k = 1}^2\mat{B}_k = \mat{A}\mlm_{k \in \{1, 2\}}\mat{B}_k = \mat{B}_1\mat{A}\t{\mat{B}_2}.
+\end{displaymath}

 %Matrices and tensors can be \emph{vectorized} by the \emph{vectorization} operator $\vec$. 
-The operator $\vec$ maps an array to a vector. Specifically,  $\vec(\mat{B})$ stands for the $pq \times 1$ vector of the $p \times q$ matrix $\mat{B}$ resulting from stacking the columns of $\mat{B}$ one after the other.  For a tensor $\ten{A}= (\ten{A}_{i_1,\ldots,i_r})$ of order $r$ and dimensions $q_1, \ldots, q_r$, $\vec(\ten{A})$ is the $q_1 q_2 \ldots q_r \times 1$ vector with the elements of $\ten{A}$ stacked one after the other in the order $r$ then $r-1$, and so on. For example, if $\ten{A}$ is a 3-dimensional array, $\vec(\ten{A})=(\vec(\ten{A}(:,:,1))^T,\vec(\ten{A}(:,:,2)^T,\ldots,\vec(\ten{A}(:,:,q_r)^T)^T$. We use the notation $\ten{A}\equiv \ten{B}$ for objects $\ten{A}, \ten{B}$ of any shape if and only if $\vec(\ten{A}) = \vec(\ten{B})$.
+The operator $\vec$ maps an array to a vector. Specifically, $\vec(\mat{B})$ stands for the $pq \times 1$ vector of the $p \times q$ matrix $\mat{B}$ resulting from stacking the columns of $\mat{B}$ one after the other. For a tensor $\ten{A}$ of order $r$ and dimensions $q_1, \ldots, q_r$, $\vec(\ten{A})$ is the $q_1 q_2 \ldots q_r \times 1$ vector with the elements of $\ten{A}$ stacked one after the other in the order $r$ then $r-1$, and so on. For example, if $\ten{A}$ is a 3-dimensional array, $\vec(\ten{A})=\t{(\t{\vec(\ten{A}_{:,:,1})},\t{\vec(\ten{A}_{:,:,2})},\ldots,\t{\vec(\ten{A}_{:,:,q_3})})}$. We use the notation $\ten{A}\equiv \ten{B}$ for objects $\ten{A}, \ten{B}$ of any shape if and only if $\vec(\ten{A}) = \vec(\ten{B})$.

 The \emph{inner product} between two tensors of the same order and dimensions is
 \begin{displaymath}
    \langle\ten{A}, \ten{B}\rangle = \sum_{i_1, \ldots, i_r} \ten{A}_{i_1, \ldots, i_r}\ten{B}_{i_1, \ldots, i_r}
 \end{displaymath}
-This leads to the definition of the \emph{Frobenius norm} for tensors, $\|\ten{A}\|_F = \sqrt{\langle\ten{A}, \ten{A}\rangle}$ and is the straightforward  extension of the Frobenius norm for matrices and vectors. %are also used for matrices while for a vector $\mat{a}$ the \emph{2 norm} is $\|\mat{a}\|_2 = \sqrt{\langle\mat{a}, \mat{a}\rangle}$.
-The \emph{outer product} between two tensors $\ten{A}$ of dimensions $q_1, \ldots, q_r$ and $\ten{B}$ of dimensions $p_1, \ldots, p_l$ is a tensor $\ten{A}\circ\ten{B}$ of order $r + l$ and dimensions $q_1, \ldots, q_r, p_1, \ldots, p_l$, such that
+This leads to the definition of the \emph{Frobenius norm} for tensors, $\|\ten{A}\|_F = \sqrt{\langle\ten{A}, \ten{A}\rangle}$ and is the straightforward  extension of the Frobenius norm for matrices and vectors. The \emph{outer product} between two tensors $\ten{A}$ of dimensions $q_1, \ldots, q_r$ and $\ten{B}$ of dimensions $p_1, \ldots, p_l$ is a tensor $\ten{A}\circ\ten{B}$ of order $r + l$ and dimensions $q_1, \ldots, q_r, p_1, \ldots, p_l$, such that
 \begin{displaymath}
    \ten{A}\circ\ten{B} \equiv (\vec\ten{A})\t{(\vec{\ten{B}})}.
 \end{displaymath}
-Let $\K : \mathbb{R}^{q_1, ..., q_{2 r}}\to\mathbb{R}^{q_1 q_{r + 1}, ..., q_r q_{2 r}}$ be defined element wise with indices $1\leq i_j + 1\leq q_j q_{r + j}$ for $j = 1, ..., r$ as
+Let $\ten{K} : \mathbb{R}^{q_1, ..., q_{2 r}}\to\mathbb{R}^{q_1 q_{r + 1}, ..., q_r q_{2 r}}$ be defined element wise with indices $1\leq i_j + 1\leq q_j q_{r + j}$ for $j = 1, ..., r$ as
 \begin{align*}
-    \K(\ten{A})_{i_1 + 1, ..., i_r + 1} &= \ten{A}_{\lfloor i_1 / q_{r + 1}\rfloor + 1, ..., \lfloor i_r / q_{2 r} \rfloor + 1, (i_1\operatorname{mod}q_{r + 1}) + 1, ..., (i_r\operatorname{mod}q_{2 r}) + 1}
+    \ten{K}(\ten{A})_{i_1 + 1, ..., i_r + 1} &= \ten{A}_{\lfloor i_1 / q_{r + 1}\rfloor + 1, ..., \lfloor i_r / q_{2 r} \rfloor + 1, (i_1\operatorname{mod}q_{r + 1}) + 1, ..., (i_r\operatorname{mod}q_{2 r}) + 1}
 \end{align*}
-where $\lfloor\,.\,\rfloor$ is the floor operation and $a\operatorname{mod}b$ is the integer division remainder of $a / b$. The mapping $\K$ is a linear operation and maps an order $2 r$ tensor to an order $r$ tensor by reshaping and permuting its elements. This operation allows  defining a generalization of the \emph{Kronecker product} to tensors, which we define as $\ten{A}\otimes\ten{B} = \K(\ten{A}\circ\ten{B})$.
+where $\lfloor\,.\,\rfloor$ is the floor operation and $a\operatorname{mod}b$ is the integer division remainder of $a / b$. The mapping $\ten{K}$ is a linear operation and maps an order $2 r$ tensor to an order $r$ tensor by reshaping and permuting its elements. This operation allows  defining a generalization of the \emph{Kronecker product} to tensors, which we define as $\ten{A}\otimes\ten{B} = \ten{K}(\ten{A}\circ\ten{B})$. The iterated outer and iterated Kronecker products are written as
+\begin{displaymath}
+    \bigouter_{k = 1}^{r} \mat{A}_k = \mat{A}_1\circ\ldots \circ\mat{A}_r,
+    \qquad
+    \bigkron_{k = 1}^{r} \mat{A}_k = \mat{A}_1\otimes\ldots \otimes\mat{A}_r
+\end{displaymath}
+with the order of iteration being important.

-For tensors of order at least $2$, the \emph{flattening} (or \emph{unfolding} or \emph{matricization}) is a reshaping of the tensor into a matrix along a particular mode. For a tensor $\ten{A}$ of order $r$ and dimensions $q_1, \ldots, q_r$, the $k$-mode unfolding $\ten{A}_{(k)}$ is a $q_k\times \prod_{l=1, l\neq k}q_l$ matrix with %For the tensor $\ten{A} = (a_{i_1,\ldots,i_r})\in\mathbb{R}^{q_1, \ldots, q_r}$ the 
-elements %of the $k$ unfolded tensor $\ten{A}_{(k)}$ are
+For tensors of order at least $2$, the \emph{flattening} (or \emph{unfolding} or \emph{matricization}) is a reshaping of the tensor into a matrix along a particular mode. For a tensor $\ten{A}$ of order $r$ and dimensions $q_1, \ldots, q_r$, the $k$-mode unfolding $\ten{A}_{(k)}$ is a $q_k\times \prod_{l=1, l\neq k}q_l$ matrix with
 \begin{displaymath}
    (\ten{A}_{(k)})_{i_k, j} = \ten{A}_{i_1, \ldots, i_r}\quad\text{ with }\quad j = 1 + \sum_{\substack{l = 1\\l \neq k}}^r (i_l - 1) \prod_{\substack{m = 1\\m\neq k}}^{l - 1}q_m.
 \end{displaymath}
-Illustration examples of vectorization and matricization are given in Appendix \ref{app:examples}. 
-% The rank of a tensor $\ten{A}$ of dimensions $q_1\times ...\times q_r$ is vector-valued; that is, $\rank(\ten{A}) = (a_1, \ldots, a_r)\in[q_1]\times...\times[q_r]$, where $a_k = \rank(\ten{A}_{(k)})$ is the usual matrix rank of the $k$ unfolded tensor.
+Illustration examples of vectorization and matricization are given in Appendix \ref{app:examples}.

 The gradient of a function $\ten{F}(\ten{X})$ of any shape, univariate, multivariate or tensor valued, with argument $\ten{X}$ of any shape is defined as the $p\times q$ matrix 
 \begin{displaymath}
    \nabla_{\ten{X}}\ten{F} = \frac{\partial\t{(\vec\ten{F}(\ten{X}))}}{\partial(\vec\ten{X})},
 \end{displaymath}
 where the vectorized quantities $\vec{\ten{X}}\in\mathbb{R}^p$ and $\vec\ten{F}(\ten{X})\in\mathbb{R}^q$. This is consistent with the gradient of a real-valued function $f(\mat{x})$ where $\mat{x}\in\mathbb{R}^p$ as $\nabla_{\mat{x}}f\in\mathbb{R}^{p\times 1}$ \parencite[ch. 15]{Harville1997}. 
-%\todo{Maybe reference magnus and abadir, magnus and neudecker?!}
-
-
-%\todo{I think the following examples are a good idea for the appendix.}



 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\section{Problem Formulation}
+\section{Problem Formulation}\label{sec:problem-formulation}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 Our goal is to infer the cumulative distribution function (cdf) $F$ of $Y\mid \ten{X}$, where $\ten{X}$ is assumed to admit $r$-tensor structure of dimension $p_1\times ... \times p_r$ with continuous or discrete entries and the response  $Y$ is unconstrained. The predictor $\ten{X}$ is a complex object; to simplify the problem we assume there exists a tensor valued  function of lower dimension $\ten{R}:\ten{X}\mapsto \ten{R}(\ten{X})$ such that 
 \begin{displaymath}
@ -351,12 +370,10 @@ f_{\mat{\eta}_y}(\ten{X}\mid Y = y)
    &= h(\ten{X})\exp(\t{\mat{\eta}_y}\mat{t}(\ten{X}) - b(\mat{\eta}_y)) \nonumber \\
    &= h(\ten{X})\exp(\langle \mat{t}_1(\ten{X}), \mat{\eta}_{1y} \rangle + \langle \mat{t}_2(\ten{X}), \mat{\eta}_{2y} \rangle - b(\mat{\eta}_{y})) \label{eq:quad-density}
 \end{align}
-where $\mat{t}_1(\ten{X})=\vec \ten{X}$ and $\mat{t}_2(\ten{X})$ is linear in $\ten{X}\circ\ten{X}$. The dependence of $\ten{X}$ on $Y$ is fully captured in the natural parameter $\mat{\eta}_y$. The function $h$ is non-negative real-valued and $b$ is assumed to be at least twice continuously differentiable and strictly convex. An important feature of the \emph{quadratic exponential family} is that the distribution of its members is fully characterized by their first two moments. Distributions within the quadratic exponential family include the \emph{tensor normal} and \emph{tensor Ising model} \todo{cite} (a generalization of the (inverse) Ising model which is multi-variate Bernoulli with up to second order interactions) and mixtures of these two. 
-
-The tensor normal distribution using the Tucker operator convention has appeared in several papers \parencite{ManceurDutilleul2013,Hoff2011,OhlsonEtAl2013} representation of tensors appeared in the doctoral thesis of Dutilleul 
+where $\mat{t}_1(\ten{X})=\vec \ten{X}$ and $\mat{t}_2(\ten{X})$ is linear in $\ten{X}\circ\ten{X}$. The dependence of $\ten{X}$ on $Y$ is fully captured in the natural parameter $\mat{\eta}_y$. The function $h$ is non-negative real-valued and $b$ is assumed to be at least twice continuously differentiable and strictly convex. An important feature of the \emph{quadratic exponential family} is that the distribution of its members is fully characterized by their first two moments. Distributions within the quadratic exponential family include the \emph{tensor normal} (\cref{sec:tensor-normal-estimation}) and \emph{tensor Ising model} (\cref{sec:ising_estimation}, a generalization of the (inverse) Ising model which is multi-variate Bernoulli with up to second order interactions) and mixtures of these two. 

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\section{The Generalized Multi-Linear Model}
+\section{The Generalized Multi-Linear Model}\label{sec:gmlm-model}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 In model \eqref{eq:quad-density}, the dependence of $\ten{X}$ and $Y$ is absorbed in $\mat{\eta}_y$, and $\mat{t}(\ten{X})$ is the minimal sufficient statistic for the \textit{pseudo}-parameter\footnote{$\mat{\eta}_y$ is a function of the response $Y$, thus it is not a parameter in the formal statistical sense. It is considered as a parameter when using the equivalence in \eqref{eq:inverse-regression-sdr} and view  $Y$ as a parameter as a device to derive the sufficient reduction from the inverse regression.} $\mat{\eta}_y = (\mat{\eta}_{1y}, \mat{\eta}_{2y})$ with
@ -396,8 +413,6 @@ where $\mat{\Omega}_j\in\mathbb{R}^{p_j\times p_j}$ are symmetric positive defin

 Equation \eqref{eq:eta2} is underdetermined since $\t{(\mat{T}_2\pinv{\mat{D}_p})}$ has full column rank $d < p^2$ (with a non-strict inequality if $\ten{X}$ is univariate) but $\mat{\eta}_2$ is uniquely determined given any $\mat{\Omega}$ as $\t{(\pinv{(\mat{T}_2\pinv{\mat{D}_p})})}$ has full row rank. We let $\mat{\xi} = (\vec{\overline{\ten{\eta}}}, \vec{\mat{B}}, \vech{\mat{\Omega}})$ be the unconstrained $p(p + 2 q + 3) / 2$-parameter vector and $\mat{\theta} = (\vec{\overline{\ten{\eta}}}, \vec{\mat{B}}, \vech{\mat{\Omega}})$ be the constrained parameter vector, where $\mat{B}=\bigotimes_{j = r}^{1}\mat{\beta}_j$ and $\mat{\Omega} = \bigotimes_{j = r}^{1}\mat{\Omega}_j$. We also let $\Xi$ and $\Theta$ denote the unconstrained and constrained parameter spaces, with $\mat{\xi}$ and $\mat{\theta}$  varying in $\Xi$ and $\Theta$, respectively. The parameter space $\Xi$ is an open subset of $\mathbb{R}^{p(p + 2 q + 3) / 2}$ so that \eqref{eq:quadratic-exp-fam} is a proper density. We relax the assumptions for $\mat{\beta}_k$ and $\mat{\Omega}_k$ later as a consequence of \cref{thm:param-manifold} in  \cref{sec:kron-manifolds}.

-% \todo{Maybe already here introduce the ``constraint'' set of $\Omega$'s allowed as $\{ \Omega\in\mathbb{R}_{++}^{p\times p} : \vec{\Omega} = \t{(\pinv{(\mat{T}_2\pinv{\mat{D}_p})}\mat{T}_2\pinv{\mat{D}_p})}\vec{\Omega} \}$}
-
 In a classical \emph{generalized linear model} (GLM), the link function connecting the natural parameters to the expectation of the sufficient statistic $\mat{\eta}_y = \mat{g}(\E[\mat{t}(\ten{X}) \mid Y = y])$ is invertible. Such a link may not exist in our setting, but for our purpose what we call the ``inverse'' link suffices. The ``inverse'' link $\widetilde{\mat{g}}$ exists as the natural parameters fully describe the distribution. As in the non-minimal formulation \eqref{eq:quadratic-exp-fam}, we define the ``inverse'' link through its tensor valued components
 \begin{align}
    \ten{g}_1(\mat{\eta}_y) &= \E[\ten{X} \mid Y = y], \label{eq:inv-link1}\\
@ -409,7 +424,6 @@ Under the quadratic exponential family model \eqref{eq:quadratic-exp-fam},  a su
    A sufficient reduction for the regression $Y\mid \ten{X}$ under the quadratic exponential family inverse regression model \eqref{eq:quadratic-exp-fam} with natural parameters \eqref{eq:eta1} and \eqref{eq:eta2} is given by
    \begin{align}\label{eq:sdr}
        \ten{R}(\ten{X})
-            % &= (\ten{X} - \E\ten{X})\times\{ \t{\mat{\beta}_1}, \ldots, \t{\mat{\beta}_r} \}.
            &= (\ten{X} - \E\ten{X})\mlm_{k = 1}^{r}\t{\mat{\beta}_j}.
    \end{align}
    The reduction \eqref{eq:sdr} is minimal if $\mat{\beta}_j$ are full rank for all $j=1,\ldots,r$.
@ -450,7 +464,7 @@ The reduction in vectorized form is $\vec\ten{R}(\ten{X})=\t{\mat{B}}\vec(\ten{X
 \end{example}

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\section{Maximum Likelihood Estimation}
+\section{Maximum Likelihood Estimation}\label{sec:ml-estimation}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 Suppose $(\ten{X}_i, Y_i)$ are independently and identically distributed with joint cdf $F(\ten{X}, Y)$, for $i = 1, \ldots, n$. The empirical log-likelihood function of \eqref{eq:quadratic-exp-fam} under \eqref{eq:eta1} and \eqref{eq:eta2}, ignoring terms not depending on the parameters, is
 \begin{equation}\label{eq:log-likelihood}
@ -479,11 +493,17 @@ A straightforward and general method for parameter estimation is \emph{gradient
 If $\mat{T}_2$ is the identity matrix $\mat{I}_{p(p + 1) / 2}$, then $\ten{G}_2(\mat{\eta}_y) = \ten{g}_2(\mat{\eta}_y)$.
 \end{theorem}

-Although the general case of any GMLM model can be fitted via gradient descent using \cref{thm:grad}, this may be very inefficient. In \cref{thm:grad}, $\mat{T}_2$ can be used to introduce flexible second moment structures. For example, it allows modeling effects differently for predictor components, as described in \cref{sec:ising_estimation} after Eqn. \eqref{eq:ising-cond-prob}. In the remainder, we focus on  $\mat{T}_2$'s that are identity matrices. This approach simplifies the estimation algorithm and the speed of the numerical calculation in  the case of tensor normal predictors.  In the case of the tensor normal distribution, an iterative cyclic updating scheme is derived in \cref{sec:tensor_normal_estimation}, which  has much faster convergence, is stable and does not require hyper parameters, as will be discussed later. On the other hand, the Ising model does not allow such a scheme. There we need to use a gradient based method, which is the subject of \cref{sec:ising_estimation}.
+Although the general case of any GMLM model can be fitted via gradient descent using \cref{thm:grad}, this may be very inefficient. In \cref{thm:grad}, $\mat{T}_2$ can be used to introduce flexible second moment structures. For example, it allows modeling effects differently for predictor components, as described in \cref{sec:ising_estimation} after Eqn. \eqref{eq:ising-cond-prob}. In the remainder, we focus on  $\mat{T}_2$'s that are identity matrices. This approach simplifies the estimation algorithm and the speed of the numerical calculation in  the case of tensor normal predictors.  In the case of the tensor normal distribution, an iterative cyclic updating scheme is derived in \cref{sec:tensor-normal-estimation}, which  has much faster convergence, is stable and does not require hyper parameters, as will be discussed later. On the other hand, the Ising model does not allow such a scheme. There we need to use a gradient based method, which is the subject of \cref{sec:ising_estimation}.

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\subsection{Tensor Normal}\label{sec:tensor_normal_estimation}
+\subsection{Tensor Normal}\label{sec:tensor-normal-estimation}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+The tensor normal, also known as the \emph{multilinear normal}, is the extension of the matrix normal to tensor-valued random variables and a member of the quadratic exponential family \eqref{eq:quadratic-exp-fam} under \eqref{eq:eta2}. \textcite{Dawid1981,Arnold1981} introduced the term matrix normal and, in particular, \textcite{Arnold1981} provided several theoretical results, such as its density, moments and conditional distributions of its components. The matrix normal distribution is  a bilinear normal distribution; a distribution of a two-way array, each component
+representing a vector of observations \parencite{OhlsonEtAl2013}. \textcite{KolloVonRosen2005,Hoff2011,OhlsonEtAl2013} presented the extension of the bilinear to the multilinear normal distribution, what we call tensor normal, using a parallel extension of bilinear matrices to multilinear tensors \parencite{Comon2009}.
+
+The defining feature of the matrix normal distribution, and its tensor extension, is the Kronecker product structure of its covariance. This formulation, where the covariates are multivariate normal with multiway covariance structure modeled as a Kronecker product of matrices of much lower dimension, aims to overcome the significant modeling and computational challenges arising from the high computational complexity of manipulating tensor representations \parencite[see, e.g.,][]{HillarLim2013,WangEtAl2022}. 
+
+% Multilinear tensor normal models have been used in various applications, including medical imaging \parencite{BasserPajevic2007,DrydenEtAl2009}, spatio-temporal data analysis \parencite{GreenewaldHero2014}, regression analysis for longitudinal relational data \parencite{Hoff2015}. One of the most important uses of the multilinear normal (MLN) distribution, and hence tensor analysis, is perhaps in magnetic resonance imaging (MRI) \parencite{OhlsonEtAl2013}. A recent survey \parencite{WangEtAl2022} and references therein contain more information and potential applications of multilinear tensor normal models.

 Suppose $\ten{X}\mid Y = y$ follows a tensor normal distribution  with mean $\ten{\mu}_y$ and covariance $\mat{\Sigma} = \bigkron_{k = r}^{1}\mat{\Sigma}_k$. We assume the distribution is non-degenerate which means that the covariances $\mat{\Sigma}_k$ are symmetric positive definite matrices. Its density is given by
 \begin{displaymath}
@ -556,11 +576,8 @@ A technical detail for numerical stability is to ensure that the scaled values $
 Furthermore, if the parameter space follows a more general setting as in \cref{thm:param-manifold}, updating may produces estimates outside  the parameter space. A simple and efficient method is to project every updated estimate onto the corresponding manifold.

 A standard method to compute the MLE of a Kronecker product is block-coordinate descent, also
-referred to as the ``flip-flop algorithm.'' This algorithm was proposed independently by \textcite{MardiaGoodall1993} and \textcite{Dutilleul1999} and was later called ``flip-flop'' algorithm by \textcite{LuZimmerman2005} for the computation of the maximum likelihood estimators of the components of a separable covariance matrix. \textcite{ManceurDutilleul2013} extended the ``flip-flop'' algorithm for the computation of the MLE of the separable covariance structure of a 3-way and 4-way normal distribution and obtained a lower bound for the sample size required for its existence. The same issue was also studied by \textcite{DrtonEtAl2020} in the case of a two-way array (matrix).
-
-\efi{here you need to explain the difference of yours with the flip-flop}
-This is (basically) the same algorithm as I use for the tensor normal except that I estimate the precision matrices $\mat{\Omega}_k = \mat{\Sigma}_k^{-1}$. 
-
+referred to as the ``flip-flop algorithm.'' This algorithm was proposed independently by \textcite{MardiaGoodall1993} and \textcite{Dutilleul1999} and was later called ``flip-flop'' algorithm by \textcite{LuZimmerman2005} for the computation of the maximum likelihood estimators of the components of a separable covariance matrix. \textcite{ManceurDutilleul2013} extended the ``flip-flop'' algorithm for the computation of the MLE of the separable covariance structure of a 3-way and 4-way normal distribution and obtained a lower bound for the sample size required for its existence. The same issue was also studied by \textcite{DrtonEtAl2020} in the case of a two-way array (matrix). Our algorithm uses a
+similar “flip-flop” approach by iteratively updating the $\mat{\beta}_k$'s and $\mat{\Omega}_k$'s, one after the other.


 \subsubsection{Matrix and Tensor Normal}
@ -627,7 +644,7 @@ To solve the optimization problem \eqref{eq:mle}, given a data set $(\ten{X}_i,
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsubsection{Initial Values}

-The first step is to get reasonable starting values. Experiments showed that a good starting value of $\mat{\beta}_k$, for $k = 1, \ldots, r$, it to use the tensor normal estimates from \cref{sec:tensor_normal_estimation}, considering $\ten{X}_i$ as continuous. For initial values of $\mat{\Omega}_k$, a different approach is required. Setting everything to the uninformed initial value, that is $\mat{\Omega}_k = \mat{0}$ as this corresponds to the conditional log odds to be $1:1$ for every component and pairwaide interaction. This is not possible, since $\mat{0}$ is a stationary point of the log-likelihood. This is directly observed by taking a look at the partial gradients of the log-likelihood in \cref{thm:grad}. Instead, we use a crude heuristic which threads every mode seperately and ignores any relation to the covariates. It is computationaly cheap and better than any of the alternatives we considered. For every $k = 1, \ldots, r$, let the $k$'th mode second moment estimate be
+The first step is to get reasonable starting values. Experiments showed that a good starting value of $\mat{\beta}_k$, for $k = 1, \ldots, r$, it to use the tensor normal estimates from \cref{sec:tensor-normal-estimation}, considering $\ten{X}_i$ as continuous. For initial values of $\mat{\Omega}_k$, a different approach is required. Setting everything to the uninformed initial value, that is $\mat{\Omega}_k = \mat{0}$ as this corresponds to the conditional log odds to be $1:1$ for every component and pairwaide interaction. This is not possible, since $\mat{0}$ is a stationary point of the log-likelihood. This is directly observed by taking a look at the partial gradients of the log-likelihood in \cref{thm:grad}. Instead, we use a crude heuristic which threads every mode seperately and ignores any relation to the covariates. It is computationaly cheap and better than any of the alternatives we considered. For every $k = 1, \ldots, r$, let the $k$'th mode second moment estimate be
 \begin{equation}\label{eq:ising-mode-moments}
    \hat{\mat{M}}_{2(k)} = \frac{p_k}{n p}\sum_{i = 1}^n (\ten{X}_i)_{(k)}\t{(\ten{X}_i)_{(k)}}
 \end{equation}
@ -664,7 +681,7 @@ In case of a finite number of observations, specifically in data sets with a sma

 The first situation where this needs to be addressed is in \eqref{eq:ising-init-Omegas}, where we set initial estimates for $\mat{\Omega}_k$. To avoid divition by zero as well as evaluating the log of zero, we addapt \eqref{eq:ising-mode-moments}, the mode wise moment estimates $\hat{\mat{M}}_{2(k)}$. A simple method is to replace the ``degenerate'' components, that are entries with value $0$ or $1$, with the smallest positive estimate of exactly one occurrence $p_k / n p$, or all but one occurrence $1 - p_k / n p$, respectively.

-The same problem is present in gradient optimization. Therefore, before starting the optimization, we detect degenerate combinations. We compute upper and lower bounds for the ``degenerate'' element in the Kronecker product $\hat{\mat{\Omega}} = \bigkron_{k = r}^{1}\hat{\mat{\Omega}}_k$. After every gradient update, we check if any of the ``degenerate'' elements fall outside of the bounds. In that case, we adjust all the elements of the Kronecker component estimates $\hat{\mat{\Omega}}_k$, corresponding to the ``degenerate'' element of their Kronecker product, to fall inside the precomputed bounds. While doing so, we try to alter every component as little as possible to ensure that the non-degenerate elements in $\hat{\mat{\Omega}}$, effected by this change due to its Kronecker structure, are altered as little as possible. The exact details are technically cumbersome while providing little insight. \todo{For more details we refer the reader to the source code prodived with the supplementary material.}
+The same problem is present in gradient optimization. Therefore, before starting the optimization, we detect degenerate combinations. We compute upper and lower bounds for the ``degenerate'' element in the Kronecker product $\hat{\mat{\Omega}} = \bigkron_{k = r}^{1}\hat{\mat{\Omega}}_k$. After every gradient update, we check if any of the ``degenerate'' elements fall outside of the bounds. In that case, we adjust all the elements of the Kronecker component estimates $\hat{\mat{\Omega}}_k$, corresponding to the ``degenerate'' element of their Kronecker product, to fall inside the precomputed bounds. While doing so, we try to alter every component as little as possible to ensure that the non-degenerate elements in $\hat{\mat{\Omega}}$, effected by this change due to its Kronecker structure, are altered as little as possible. The exact details are technically cumbersome while providing little insight.



@ -682,11 +699,8 @@ For estimation of dimensions $p$ bigger than $20$, we use a Monte-Carlo method t
 \end{figure}


-\newpage
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\section{Statistical Properties}
-
-\subsection{Kronecker Product Manifolds}\label{sec:kron-manifolds}
+\section{Manifolds}\label{sec:manifolds}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 \cref{thm:sdr} identifies the sufficient reduction for the regression of $Y$ on $\ten{X}$ in the population. Any estimation of the sufficient reduction requires application of some optimality criterion. As we operate within the framework of the exponential family, we opted for maximum likelihood estimation (MLE). For the unconstrained problem, where the parameters are simply $\mat{B}$ and $\mat{\Omega}$ in \eqref{eq:eta1-manifold}, maximizing the likelihood of $\ten{X} \mid Y$ is straightforward and yields well-defined MLEs of both parameters. Our setting, though, requires the constrained optimization of the $\ten{X} \mid Y$ likelihood subject to  $\mat{B} = \bigotimes_{j = r}^{1}\mat{\beta}_j$ and $\mat{\Omega}=\bigkron_{j = r}^{1}\mat{\Omega}_j$. \Cref{thm:kron-manifolds,thm:param-manifold} provide the setting for which the MLE of the constrained parameter $\mat{\theta}$ is well-defined, which in turn leads to the derivation of its asymptotic normality.
@ -720,6 +734,8 @@ We also need the concept of a \emph{tangent space} to formulate asymptotic norma



+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\subsection{Kronecker Product Manifolds}\label{sec:kron-manifolds}

 As a basis to ensure that the constrained parameter space $\Theta$ is a manifold, which is a requirement  of \cref{thm:param-manifold}, we need \cref{thm:kron-manifolds}. Therefore, we need the notion of a \emph{spherical} set, which is a set $\manifold{A}$, on which the Frobenius norm is constant. That is, $\|\,.\,\|_F:\manifold{A}\to\mathbb{R}$ is constant. Forthermore, we call a scale invariant set $\manifold{A}$ a \emph{cone}, that is $\manifold{A} = \{ c \mat{A} : \mat{A}\in\manifold{A} \}$ for all $c > 0$.

@ -748,6 +764,8 @@ As a basis to ensure that the constrained parameter space $\Theta$ is a manifold
    then the constrained parameter space $\Theta = \mathbb{R}^p \times \manifold{K}_{\mat{B}}\times\manifold{CK}_{\mat{\Omega}}\subset\mathbb{R}^{p(p + 2 q + 3) / 2}$ is a smooth embedded manifold.
 \end{theorem}

+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{Matrix Manifolds}\label{sec:matrix-manifolds}
 A powerful side effect of \cref{thm:param-manifold} is the modeling flexibinity it provides. For example, we can perform low rank regression. Or, we may constrain two-way interactions between direct axis neighbors by using band matrices for the $\mat{\Omega}_k$'s, among others.

@ -765,9 +783,9 @@ This flexibility derives from many different matrix manifolds that can be used a
            \xmark     & \checkmark & $p q - q (q + 1) / 2$  \\ \hline
        $\mathcal{S}^{p - 1}$ & Unit sphere in $\mathbb{R}^p$, special case $\Stiefel{p}{1}$ &
            \xmark     & \checkmark & $p - 1$                \\ \hline
-        $\UnitaryGrp{p}$ & Unitary Group, special case $\Stiefel{p}{p}$ &
+        $\OrthogonalGrp{p}$ & Orthogonal Group, special case $\Stiefel{p}{p}$ &
            \xmark     & \checkmark & $p (p - 1) / 2$        \\ \hline
-        $\SpecialUnitaryGrp{p}$ & Special Unitary Group $\{ \mat{U}\in U(p) : \det{\mat{U}} = 1 \}$ &
+        $\SpecialOrthogonalGrp{p}$ & Special Orthogonal Group $\{ \mat{U}\in U(p) : \det{\mat{U}} = 1 \}$ &
            \xmark     & \checkmark & $p (p - 1) / 2$        \\ \hline
        $\mathbb{R}_{r}^{p\times q}$ & Matrices of known rank $r > 0$, generalizes $\StiefelNonCompact{p}{q}$ &
            \checkmark & \xmark     & $r(p + q - r)$         \\ \hline
@ -791,6 +809,8 @@ This flexibility derives from many different matrix manifolds that can be used a
 \end{remark}


+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\section{Statistical Properties}
 \subsection{Asymptotics}\label{sec:asymtotics}

 Let $Z$ be a random variable distributed according to a parameterized probability distribution with density $f_{\mat{\theta_0}}\in\{ f_{\mat{\theta}} : \mat{\theta}\in\Theta \}$ where $\Theta$ is a subset of a Euclidean space. We want to estimate the parameter ${\mat{\theta}}_0$ using $n$ i.i.d. (independent and identically distributed) copies of $Z$. We assume a known, real-valued and measurable function $z\mapsto m_{\mat{\theta}}(z)$ for every $\mat{\theta}\in\Theta$ and that  ${\mat{\theta}}_0$ is the unique maximizer of the map $\mat{\theta}\mapsto M(\mat{\theta}) = \E m_{\mat{\theta}}(Z)$. For the estimation we maximize the empirical version
@ -861,13 +881,11 @@ for every non-empty compact $K\subset\Xi$.     Then, there exists a strong M-est
    \cref{thm:M-estimator-asym-normal-on-manifolds} has as special case Theorem~5.23 in \textcite{vanderVaart1998}, when $\Theta$ is an open subset of a Euclidean space, which is the simplest form of an embedded manifold.
 \end{remark}

-%%% I decided its fine to just NOT state those posibilities. Just gets comfusing without any additional insigt!
-% \todo{The assumptions of the following can be a bit weakened, is this necessary? For example the Hessian can be singular but is non-singular constrained to the tangent space. We can also only define $\mat{\theta}\mapsto m_{\mat{\theta}}$ only on the manifold which makes the statement much more technical, but way more general while we need to ensure that every smooth local extension of $\mat{\theta}\mapsto m_{\mat{\theta}}$ yields the same statement, which it does, but well, then it gets more complicated! Maybe add these things as a remark? The even more general formulation for Riemannian Manifolds is definitely over the top!}

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\section{Simulations}
+\section{Simulations}\label{sec:simulations}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-In this section we report simulation results for the tensor normal and the Ising model where different aspects of the GMLM model are compared against other methods. The comparison methods are Tensor Sliced Inverse Regression (TSIR) \parencite{DingCook2015}, MGCCA \todo{\parencite{ChenEtAl2021,GirkaEtAl2024}} and the Tucker decomposition that is a higher-order form of principal component analysis (HOPCA) \textcite{KoldaBader2009}, for both continuous and binary data. For the latter, the binary values are  treated as continuous. As a base line we also include classic PCA on vectorized observations. In case of the Ising model, we also compare with LPCA (Logistic PCA) and CLPCA (Convex Logistic PCA), both introduced in \textcite{LandgrafLee2020}. All experiments are performed at sample size $n = 100, 200, 300, 500$ and $750$. Every experiment is repeated $100$ times.
+In this section we report simulation results for the tensor normal and the Ising model where different aspects of the GMLM model are compared against other methods. The comparison methods are Tensor Sliced Inverse Regression (TSIR) \parencite{DingCook2015}, Multiway Generalized Canonical Correlation Analysis (MGCCA) \parencite{ChenEtAl2021,GirkaEtAl2024} and the Tucker decomposition that is a higher-order form of principal component analysis (HOPCA) \textcite{KoldaBader2009}, for both continuous and binary data. For the latter, the binary values are  treated as continuous. As a base line we also include classic PCA on vectorized observations. In case of the Ising model, we also compare with LPCA (Logistic PCA) and CLPCA (Convex Logistic PCA), both introduced in \textcite{LandgrafLee2020}. All experiments are performed at sample size $n = 100, 200, 300, 500$ and $750$. Every experiment is repeated $100$ times.

 We are interested in the quality of the estimate of the true sufficient reduction $\ten{R}(\ten{X})$ from \cref{thm:sdr}. Therefore, we compare with the true vectorized reduction matrix $\mat{B} = \bigkron_{k = r}^{1}\mat{\beta}_k$, as it is compatible with any linear reduction method. The distance $d(\mat{B}, \hat{\mat{B}})$ between $\mat{B}\in\mathbb{R}^{p\times q}$ and an estimate $\hat{\mat{B}}\in\mathbb{R}^{p\times \tilde{q}}$ is the \emph{subspace distance} which is proportional to
 \begin{displaymath}
@ -917,70 +935,6 @@ An investigation into this behaviour revealed the problem in the estimation of t
 Simulation 1d), incorporating information about the covariance structure behaves similar to 1b), except that GMLM gains a statistically significant lead in estimation performance. The last simulation, 1e), where the model was misspecified for GMLM. GMLM, TSIR as well as MGCCA are on par where GMLM has a sligh lead in the small sample size setting and MGCCA overtakes in higher sample scenarios. The PCA and HOPCA methods both still outperformed. A wrong assumption about the relation to the response is still better than no relation at all.


-% \todo{How to describe that? I mean, sure, but what to write?}
-% The results of 1c) are surprising. The GMLM model behaves as expected, clearly being the best. The first surprise is that PCA, HOPCA and MGCCA are visually indistinguishable. This is explained by a high signal to noise ration in this particular example. But the biggest surprise is the failure of TSIR. It turns out that TSIR is usually well equipped to handle those specific low rank problems (given the true rank of the problem which is the case for all methods in every simulation), but by pure coincidence we picked a case where TSIR fails. Intending to pinpoint the specific problem we made another simulation where we change the tensor order $r$ from $2$ till $4$. Furthermore, we altered the coefficient $\rho$ for the auto regression type matrices $(\mat{\Omega}_k)_{i j} = \rho^{|i - j|}$. We let $\ten{F}_y$ be the $r$ times iterated outer product of the vector $(1, y)$. In the case of $r = 3$ this given the same $\ten{F}_y$ as in 1c). Then, we setup two scenarios where the definition of the true $\mat{\beta}_k$'s of rank $1$, for $k = 1, \ldots, r$, are different. The rest is identical to simulation 1c).
-% \begin{itemize}
-%    \item[V1)] The first version sets all $\mat{\beta}_k$'s identical to
-%   \begin{displaymath}
-%        \mat{\beta}_k = \begin{pmatrix} 1 & 1 \\ 0 & 0 \end{pmatrix}
-%    \end{displaymath}
-%    which gives the true vectorized reduction matrix $\mat{B} = \bigkron_{k = r}^{1}\mat{\beta}_k$ equal to a $2^r\times 2^r$ rank $1$ matrix with the first row all ones and the rest filled with zeros. The minimal true reduction is the $2^r$ dimensional first unit vector $\mat{e}_1$. In this setting the vectorized expected value is given by $\E[\vec{\ten{X}} \mid Y = y] = (1 + y)^r \bigkron_{k = r}^{1}\mat{\Omega}_k^{-1}$
-%
-%    \begin{displaymath}
-%        \mat{\Omega}_k = \begin{pmatrix}
-%            1 & \rho \\ \rho & 1
-%        \end{pmatrix}
-%    \end{displaymath}
-%    \begin{displaymath}
-%        \mat{\Sigma}_k = \mat{\Omega}_k^{-1} = \frac{1}{1 - \rho^2}\begin{pmatrix}
-%            1 & -\rho \\ -\rho & 1
-%        \end{pmatrix}
-%    \end{displaymath}
-%
-%    \begin{displaymath}
-%        \E[\ten{X} \mid Y = y] = \frac{(1 + y)^r}{(1 - \rho^2)^r}\bigouter_{k = 1}^{r}\begin{pmatrix}
-%            1 \\ -\rho
-%        \end{pmatrix}
-%    \end{displaymath}
-%    \begin{displaymath}
-%        \E[\vec{\ten{X}} \mid Y = y] = \frac{(1 + y)^r}{(1 - \rho^2)^r}\bigkron_{k = r}^{1}\begin{pmatrix}
-%            1 \\ -\rho
-%        \end{pmatrix}
-%    \end{displaymath}
-%
-%    In this setting only the first component of $\ten{X}$, that is $(\vec{\ten{X}})_1$, depends directly on $Y$ via $\E[(\vec{\ten{X}})_1\mid Y = y] = (1 + y)^r$. All other components contain information about $Y$ through the correlation structure only. \todo{check this!}
-%    \item[V2)] Similar to the $\mat{\beta}_k$'s in 1c), we set all $\mat{\beta}_k$'s identical to
-%    \begin{displaymath}
-%        \mat{\beta}_k = \begin{pmatrix} 1 & -1 \\ -1 & 1 \end{pmatrix}
-%    \end{displaymath}
-%
-%    \begin{displaymath}
-%        \E[\vec{\ten{X}}\mid Y = y] = \frac{(1 - y)^r}{(1 - \rho)^r}\bigkron_{k = r}^{1}\begin{pmatrix}
-%            1 \\ -1
-%        \end{pmatrix}
-%    \end{displaymath}
-%\end{itemize}
-
-%\begin{figure}[ht!]
-%    \centering
-%    \includegraphics[width = \textwidth]{plots/sim-tsir.pdf}
-%    \caption{\label{fig:sim-tsir}Simulation to investiage the unexpected failure of TSIR in simulation 1c.}
-%\end{figure}
-
-
-
-
-% simplified the simulation such that $p_k = q_k = 2$ for $k = 1, \ldots, r$. We let the functions $\ten{F}_y = \bigcirc_{k = 1}^{r}(1, y)$ 
-
-% Then, we simulate with $100$ replications per case where every case 
-
-
-
-% The setup which is ídentical in both cases is as follows. The response $Y$ is i.i.d. standard normal and the response tensor $\ten{F}_y$ consists of monomials with max order $r$. Its elements are equal to $(\ten{F}_y)_{\mat{i}} = y^{|\mat{i}| - r}$ where $\mat{i}$ is a multi-index of length $r$ and $|mat{i}|$ is the sum of th elements of $\mat{i}$.
-
-
-
-
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{Ising Model}\label{sec:sim-ising}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -1041,7 +995,7 @@ if not mentioned otherwise in a specific simulation setup given next.
    \caption{\label{fig:sim-ising}Visualization of the simulation results for Ising GMLM. Sample size on the $x$-axis and the mean of subspace distance $d(\mat{B}, \hat{\mat{B}})$ over $100$ replications on the $y$-axis. Described in \cref{sec:sim-ising}.}
 \end{figure}

-The simulation results are visualized in \cref{fig:sim-ising}. Regardless of the simulation setting 2a-d), the comparative results are similar. We observe that PCA and HOPCA, both treating the response $\ten{X}$ as continuous, perform poorly. Not much better are LPCA and CLPCA. Similar to PCA and HOPCA they do not consider the relation to the response, but they are specifically created for binary predictors. Next we have MGCCA which is the first method considering the relation to the response $y$, clearly out-performing all the PCA variants. Even better is TSIR, regardless of the treatment of the predictors $\ten{X}$ as continuous, achieving very god results. Finally, the Ising GMLM model is the best in all the simulations although TSIR gets very close in some settings.
+The simulation results are visualized in \cref{fig:sim-ising}. Regardless of the simulation setting 2a-d), the comparative results are similar. We observe that PCA and HOPCA, both treating the response $\ten{X}$ as continuous, perform poorly. Not much better are LPCA and CLPCA. Similar to PCA and HOPCA they do not consider the relation to the response, but they are specifically created for binary predictors. Next we have MGCCA which is the first method considering the relation to the response $y$, clearly out-performing all the PCA variants. Even better is TSIR, regardless of the treatment of the predictors $\ten{X}$ as continuous, achieving very good results. Finally, the Ising GMLM model is the best in all the simulations although TSIR gets very close in some settings.

 % Due to the surprisingly good result of TSIR, we also applied the tensor normal GMLM model to the exact same simulation, simply treating the response $\ten{X}$ as continuous.
 % The raw linear reduction estimates of both the Ising GMLM and the tensor normal GMLM are basically indistinguishable, similar to the very similar results of the different PCA variants. The main reason for this specific
@ -1062,9 +1016,9 @@ The simulation results are visualized in \cref{fig:sim-ising}. Regardless of the


 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\section{Data Analysis}
+\section{Data Analysis}\label{sec:data-analysis}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-In this section we perform two applications of the GMLM model on real data. First example is the tensor normal model applied to EEG data. Next, we perform a prove of concept data analysis example for chess. % The main purpose of choosing chess is two fold. First, we can illustrate an explicit use case for the (till now ignored) linear constraint matrix $\mat{T}_2$\todo{check!}. Second, its a personal interest of one of the authors.\todo{???}
+In this section we perform two applications of the GMLM model on real data. First example is the tensor normal model applied to EEG data. Next, we perform a prove of concept data analysis example for chess.


 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -1073,9 +1027,7 @@ The EEG data (\url{http://kdd.ics.uci.edu/databases/eeg/eeg.data.html}) is a sma

 For a comparison we reproduced the leave-one-out cross-validation EEG data analysis \textcite[Sec. 7]{PfeifferKaplaBura2021} for the classification task. In this data set, $p= p_1 p_2 = 16384$ is much larger than $n=122$. To deal with this issue, \textcite{PfeifferKaplaBura2021} used two approaches. In the first, pre-screening via (2D)$^2$PCA \parencite{ZhangZhou2005} reduced the dimensions to $(p_1, p_2) = (3, 4)$, $(15, 15)$ and $(20, 30)$. In the second, simultaneous dimension reductions and variable selection was carried out using the fast POI-C algorithm of \textcite{JungEtAl2019} (due to high computational high burden, only a 10-fold cross-validation was performed for fast POI-C).

-In contrast to \textcite{PfeifferKaplaBura2021},  our GMLM model can be applied directly to the  raw data of dimension $(256, 64)$ without pre-screening or variable selection. This was not possible for K-PIR as the time axis alone was in the large $p$ small $n$ regime with the $p_1 = 256 > n = 122$ leading to a singular time axis covariance. The same issue is present in the GMLM model, but the regularization trick used for numerical stability, as described in \cref{sec:tensor_normal_estimation}, resolves this without any change to the estimation procedure.
-In general, the sample size does not need to be large  for maximum likelihood estimation in the tensor normal model. \todo{In matrix normal models in particular,  \textcite{DrtonEtAl2020} proved that very small
-sample sizes are sufficient to obtain unique MLEs for  Kronecker covariance structures.}
+In contrast to \textcite{PfeifferKaplaBura2021}, our GMLM model can be applied directly to the raw data of dimension $(256, 64)$ without pre-screening or variable selection. This was not possible for K-PIR as the time axis alone was in the large $p$ small $n$ regime with the $p_1 = 256 > n = 122$ leading to a singular time axis covariance. The same issue is present in the GMLM model, but the regularization trick used for numerical stability, as described in \cref{sec:tensor-normal-estimation}, resolves this without any change to the estimation procedure. In general, the sample size does not need to be large for maximum likelihood estimation in the tensor normal model. In matrix normal models in particular, \cite{DrtonEtAl2020} proved that very small sample sizes, as little as $3$,\footnote{The required minimum sample size depends on a non-trivial algebraic relations between the mode dimensions, while the magnitude of the dimensions has no specific role.} are sufficient to obtain unique MLEs for Kronecker covariance structures.

 We use leave-one-out cross-validation to obtain unbiased AUC estimates. Then, we compare the GMLM model to the best performing methods from \textcite{PfeifferKaplaBura2021}, namely K-PIR (ls) and LSIR from \textcite{PfeifferForzaniBura2012} for $(p_1, p_2) = (3, 4)$, $(15, 15)$ and $(20, 30)$.

@ -1133,7 +1085,7 @@ which gives the complete $48$ dimensional vectorized reduction by stacking the p
        = (\vec{\ten{R}(\ten{X}_{\text{white pawn}})}, \ldots, \vec{\ten{R}(\ten{X}_{\text{black king}})})
        = \t{\mat{B}}\vec(\ten{X} - \E\ten{X}).
 \end{displaymath}
-The second line encodes all the piece wise reductions in a block diagonal full reduction matrix $\mat{B}$ of dimension $768\times 48$ which is applied to the vectorized 3D tensor $\ten{X}$ combining all the piece components $\ten{X}_{\mathrm{piece}}$ into a single tensor of dimension $8\times 8\times 12$. This is a reduction to $6.25\%$ of the original dimension. The $R^2$ statistic of the GAM fitted on $10^5$ new reduced samples is $R^2_{\mathrm{gam}}\approx 46\%$. A linear model on the reduced data achieves $R^2_{\mathrm{lm}}\approx 26\%$ which clearly shows the non-linear relation. On the other hand, the static evaluation of the \emph{Schach H\"ornchen}\footnote{Main authors personal chess engine, used code from the engine is provided in the supplementary material as the \texttt{Rchess} R-package in the \todo{supplementary material}.} engine, given the full position (\emph{not} reduced), achieves an $R^2_{\mathrm{hce}}\approx 52\%$. The $42\%$ are reasonably well compared to $51\%$ of the engine static evaluation which gets the original position and uses chess specific expect knowledge. Features the static evaluation includes, which are expected to be learned by the GMLM mixture model, are; \emph{material} (piece values) and \emph{piece square tables} (PSQT, preferred piece type positions). In addition, the static evaluation includes chess specific features like \emph{king safety}, \emph{pawn structure} or \emph{rooks on open files}. This lets us conclude that the reduction captures most of the relevant features possible, given the oversimplified modeling we performed.
+The second line encodes all the piece wise reductions in a block diagonal full reduction matrix $\mat{B}$ of dimension $768\times 48$ which is applied to the vectorized 3D tensor $\ten{X}$ combining all the piece components $\ten{X}_{\mathrm{piece}}$ into a single tensor of dimension $8\times 8\times 12$. This is a reduction to $6.25\%$ of the original dimension. The $R^2$ statistic of the GAM fitted on $10^5$ new reduced samples is $R^2_{\mathrm{gam}}\approx 46\%$. A linear model on the reduced data achieves $R^2_{\mathrm{lm}}\approx 26\%$ which clearly shows the non-linear relation. On the other hand, the static evaluation of the \emph{Schach H\"ornchen}\footnote{Main authors personal chess engine.} engine, given the full position (\emph{not} reduced), achieves an $R^2_{\mathrm{hce}}\approx 52\%$. The $42\%$ are reasonably well compared to $51\%$ of the engine static evaluation which gets the original position and uses chess specific expect knowledge. Features the static evaluation includes, which are expected to be learned by the GMLM mixture model, are; \emph{material} (piece values) and \emph{piece square tables} (PSQT, preferred piece type positions). In addition, the static evaluation includes chess specific features like \emph{king safety}, \emph{pawn structure} or \emph{rooks on open files}. This lets us conclude that the reduction captures most of the relevant features possible, given the oversimplified modeling we performed.

 \paragraph{Interpretation:} For a compact interpretation of the estimated reduction we construct PSQTs. To do so we use the linear model from the validation section. Then, we rewrite the combined linear reduction and linear model in terms of PSQTs. Let $\mat{B}$ be the $768\times 48$ full vectorized linear reduction. This is the block diagonal matrix with the $64\times 4$ dimensional per piece reductions $\mat{B}_{\mathrm{piece}} = \mat{\beta}^{\mathrm{piece}}_2\otimes\mat{\beta}^{\mathrm{piece}}_1$. Then, the linear model with coefficients $\mat{b}$ and intercept $a$ on the reduced data is given by
 \begin{equation}\label{eq:chess-lm}
@ -1177,6 +1129,79 @@ Next, goint one by one through the PSQTs, a few words about the prefered positio
 \appendix
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

+\section{Examples}\label{app:examples}
+
+\begin{example}[Vectorization]\label{ex:vectorization}
+    Given a matrix
+    \begin{displaymath}
+        \mat{A} = \begin{pmatrix}
+            1 & 4 & 7 \\
+            2 & 5 & 8 \\
+            3 & 6 & 9
+        \end{pmatrix}
+    \end{displaymath}
+    its vectorization is $\vec{\mat{A}} = \t{(1, 2, 3, 4, 5, 6, 7, 8, 9)}$ and its half vectorization $\vech{\mat{A}} = \t{(1, 2, 3, 5, 6, 9)}$. Let a $\ten{A}$ be a tensor of dimension $3\times 3\times 3$ given by
+    \begin{displaymath}
+        \ten{A}_{:,:,1} = \begin{pmatrix}
+            1 & 4 & 7 \\
+            2 & 5 & 8 \\
+            3 & 6 & 9
+        \end{pmatrix},
+        \qquad
+        \ten{A}_{:,:,2} = \begin{pmatrix}
+            10 & 13 & 16 \\
+            11 & 14 & 17 \\
+            12 & 15 & 18
+        \end{pmatrix},
+        \qquad
+        \ten{A}_{:,:,3} = \begin{pmatrix}
+            19 & 22 & 25 \\
+            20 & 23 & 26 \\
+            21 & 24 & 27
+        \end{pmatrix}.
+    \end{displaymath}
+    Then the vectorization of $\ten{A}$ is given by
+    \begin{displaymath}
+        \vec{\ten{A}} = \t{(1, 2, 3, 4, ..., 26, 27)}\in\mathbb{R}^{27}.
+    \end{displaymath}
+\end{example}
+
+\begin{example}[Matricization]
+    Let $\ten{A}$ be the $3\times 4\times 2$ tensor given by
+    \begin{displaymath}
+        \ten{A}_{:,:,1} = \begin{pmatrix}
+            1 & 4 & 7 & 10 \\
+            2 & 5 & 8 & 11 \\
+            3 & 6 & 9 & 12
+        \end{pmatrix},
+        \ten{A}_{:,:,2} = \begin{pmatrix}
+            13 & 16 & 19 & 22 \\
+            14 & 17 & 20 & 23 \\
+            15 & 18 & 21 & 24
+        \end{pmatrix}.
+    \end{displaymath}
+    Its matricizations are then
+    \begin{gather*}
+        \ten{A}_{(1)} = \begin{pmatrix}
+            1 &  4 &  7 & 10 & 13 & 16 & 19 & 22 \\
+            2 &  5 &  8 & 11 & 14 & 17 & 20 & 23 \\
+            3 &  6 &  9 & 12 & 15 & 18 & 21 & 24
+        \end{pmatrix},
+        \qquad
+        \ten{A}_{(2)} = \begin{pmatrix}
+             1 &  2 &  3 & 13 & 14 & 15 \\
+             4 &  5 &  6 & 16 & 17 & 18 \\
+             7 &  8 &  9 & 19 & 20 & 21 \\
+            10 & 11 & 12 & 22 & 23 & 24
+        \end{pmatrix}, \\
+        \ten{A}_{(3)} = \begin{pmatrix}
+             1 &  2 &  3 &  4 &  5 &  6 &  7 &  8 &  9 &  10 &  11 &  12 \\
+            13 & 14 & 15 & 16 & 17 & 18 & 19 & 20 & 21 &  22 &  23 &  24
+        \end{pmatrix}.
+    \end{gather*}
+\end{example}
+
+
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{Tensor Calculus and Multi Linear Algebra}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -1284,10 +1309,6 @@ as well as for any tensor $\ten{A}$ of even order $2 r$ and matching square matr
    Equality $(a)$ uses the relation $\vec(\mat{C}\mat{a}\t{\mat{b}}) = (\mat{I}_{\dim(\mat{b})}\otimes\mat{C})\vec(\mat{a}\t{\mat{b}})$ for a matrix $\mat{C}$ and vectors $\mat{a}, \mat{b}$.
 \end{proof}

-% \begin{remark}
-%     \todo{simplification of $\mat{S}$ if all dimensions are equal, that is if $p_k = p_j$ and $p_k = q_k$ for all $k, j$?!}
-% \end{remark}
-
 \begin{remark}
    The permutation matrix $\mat{K}_{p, q}$ represents a perfect outer $p$-shuffle of $p q$ elements.
 \end{remark}
@ -1334,7 +1355,6 @@ which obtains that
 is also a sufficient reduction, though not necessarily minimal,     using $\mat{B} = \bigkron_{k = 1}^{r}\mat{\beta}_k$. When the exponential family is full rank, which in our setting amounts to all $\mat{\beta}_j$ being full rank matrices, $j=1,\ldots,r$, then \textcite[Thm~1]{BuraDuarteForzani2016} also obtains the minimality of the reduction.  
 \end{proof}

-\todo{check proof of Thm 2}
 \begin{proof}[Proof of \cref{thm:grad}]\label{proof:grad}
    We first note that for any exponential family with density \eqref{eq:quad-density} the term $b(\mat{\eta}_{y_i})$ differentiated with respect to the natural parameter $\mat{\eta}_{y_i}$ is the expectation of the statistic $\mat{t}(\ten{X})$ given $Y = y_i$. In our case we get $\nabla_{\mat{\eta}_{y_i}}b = (\nabla_{\mat{\eta}_{1{y_i}}}b, \nabla_{\mat{\eta}_2}b)$ with components
    \begin{displaymath}
@ -1405,7 +1425,7 @@ is also a sufficient reduction, though not necessarily minimal,     using $\mat{
 \end{proof}

 \begin{proof}[Proof of \cref{thm:kron-manifolds}]\label{proof:kron-manifolds}
-    We start by considering the first case and assume that $\manifold{B}$ is spherical with radius $1$ w.l.o.g. We equip $\manifold{K} = \{ \mat{A}\otimes \mat{B} : \mat{A}\in\manifold{A}, \mat{B}\in\manifold{B} \}\subset\mathbb{R}^{p_1 p_2\times q_1 q_2}$ with the subspace topology \efi{cite{?}}. Define the hemispheres $H_i^{+} = \{ \mat{B}\in\manifold{B} : (\vec{\mat{B}})_i > 0 \}$ and $H_i^{-} = \{ \mat{B}\in\manifold{B} : (\vec{\mat{B}})_i < 0 \}$ for $i = 1, ..., p_2 q_2$. The hemispheres are an open cover of $\manifold{B}$ with respect to the subspace topology. Define for every $H_i^{\pm}$, where $\pm$ is a placeholder for ether $+$ or $-$, the function
+    We start by considering the first case and assume that $\manifold{B}$ is spherical with radius $1$ w.l.o.g. We equip $\manifold{K} = \{ \mat{A}\otimes \mat{B} : \mat{A}\in\manifold{A}, \mat{B}\in\manifold{B} \}\subset\mathbb{R}^{p_1 p_2\times q_1 q_2}$ with the subspace topology\footnote{Given a topological space $(\manifold{M}, \mathcal{T})$ and a subset $\mathcal{S}\subseteq\manifold{M}$ the \emph{subspace topology} on $\manifold{S}$ induced by $\manifold{M}$ consists of all open sets in $\manifold{M}$ intersected with $\manifold{S}$, that is $\{ O\cap\manifold{S} : O\in\mathcal{T} \}$ \parencite{Lee2012,,Lee2018,Kaltenbaeck2021}.}. Define the hemispheres $H_i^{+} = \{ \mat{B}\in\manifold{B} : (\vec{\mat{B}})_i > 0 \}$ and $H_i^{-} = \{ \mat{B}\in\manifold{B} : (\vec{\mat{B}})_i < 0 \}$ for $i = 1, ..., p_2 q_2$. The hemispheres are an open cover of $\manifold{B}$ with respect to the subspace topology. Define for every $H_i^{\pm}$, where $\pm$ is a placeholder for ether $+$ or $-$, the function
    \begin{displaymath}
        f_{H_i^{\pm}} : \manifold{A}\times H_i^{\pm}\to\mathbb{R}^{p_1 p_2\times q_1 q_2}
            : (\mat{A}, \mat{B})\mapsto \mat{A}\otimes \mat{B}
@ -1749,117 +1769,4 @@ The following is a technical Lemma used in the proof of \cref{thm:asymptotic-nor
 \printbibliography[heading=bibintoc, title={References}]
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

-\appendix
-\section{Examples}\label{app:examples}
-
-\begin{example}[Vectorization]\label{ex:vectorization}
-    Given a matrix
-    \begin{displaymath}
-        \mat{A} = \begin{pmatrix}
-            1 & 4 & 7 \\
-            2 & 5 & 8 \\
-            3 & 6 & 9
-        \end{pmatrix}
-    \end{displaymath}
-    its vectorization is $\vec{\mat{A}} = \t{(1, 2, 3, 4, 5, 6, 7, 8, 9)}$ and its half vectorization $\vech{\mat{A}} = \t{(1, 2, 3, 5, 6, 9)}$. Let a $\ten{A}$ be a tensor of dimension $3\times 3\times 3$ given by
-    \begin{displaymath}
-        \ten{A}_{:,:,1} = \begin{pmatrix}
-            1 & 4 & 7 \\
-            2 & 5 & 8 \\
-            3 & 6 & 9
-        \end{pmatrix},
-        \qquad
-        \ten{A}_{:,:,2} = \begin{pmatrix}
-            10 & 13 & 16 \\
-            11 & 14 & 17 \\
-            12 & 15 & 18
-        \end{pmatrix},
-        \qquad
-        \ten{A}_{:,:,3} = \begin{pmatrix}
-            19 & 22 & 25 \\
-            20 & 23 & 26 \\
-            21 & 24 & 27
-        \end{pmatrix}.
-    \end{displaymath}
-    Then the vectorization of $\ten{A}$ is given by
-    \begin{displaymath}
-        \vec{\ten{A}} = \t{(1, 2, 3, 4, ..., 26, 27)}\in\mathbb{R}^{27}
-    \end{displaymath}
-    while the half vectorization is
-    \begin{displaymath}
-        \vech{\ten{A}} = \t{(1, 2, 3, 5, 6, 9, 11, 12, 15, 21)}\in\mathbb{R}^{10}.
-    \end{displaymath}
-\end{example}
-
-\begin{example}[Matricization]
-    Let $\ten{A}$ be the $3\times 4\times 2$ tensor given by
-    \begin{displaymath}
-        \ten{A}_{:,:,1} = \begin{pmatrix}
-            1 & 4 & 7 & 10 \\
-            2 & 5 & 8 & 11 \\
-            3 & 6 & 9 & 12
-        \end{pmatrix},
-        \ten{A}_{:,:,2} = \begin{pmatrix}
-            13 & 16 & 19 & 22 \\
-            14 & 17 & 20 & 23 \\
-            15 & 18 & 21 & 24
-        \end{pmatrix}.
-    \end{displaymath}
-    Its matricizations are then
-    \begin{gather*}
-        \ten{A}_{(1)} = \begin{pmatrix}
-            1 &  4 &  7 & 10 & 13 & 16 & 19 & 22 \\
-            2 &  5 &  8 & 11 & 14 & 17 & 20 & 23 \\
-            3 &  6 &  9 & 12 & 15 & 18 & 21 & 24
-        \end{pmatrix},
-        \qquad
-        \ten{A}_{(2)} = \begin{pmatrix}
-             1 &  2 &  3 & 13 & 14 & 15 \\
-             4 &  5 &  6 & 16 & 17 & 18 \\
-             7 &  8 &  9 & 19 & 20 & 21 \\
-            10 & 11 & 12 & 22 & 23 & 24
-        \end{pmatrix}, \\
-        \ten{A}_{(3)} = \begin{pmatrix}
-             1 &  2 &  3 &  4 &  5 &  6 &  7 &  8 &  9 &  10 &  11 &  12 \\
-            13 & 14 & 15 & 16 & 17 & 18 & 19 & 20 & 21 &  22 &  23 &  24
-        \end{pmatrix}.
-    \end{gather*}
-\end{example}
-
-% \begin{example}[Symmetrization]
-%     Let $\ten{A}$ be the $3\times 3\times 3$ tensor from \cref{ex:vectorization}, then the symmetrization of $\ten{A}$ is
-%     \begin{align*}
-%         (\sym{\ten{A}})_{:,:,1} &= \frac{1}{6}\begin{pmatrix}
-%              6 & 32 &  58 \\
-%             32 & 58 &  84 \\
-%             58 & 84 & 110
-%         \end{pmatrix}, \\
-%         (\sym{\ten{A}})_{:,:,2} &= \frac{1}{6}\begin{pmatrix}
-%             32 &  58 &  84 \\
-%             58 &  84 & 110 \\
-%             84 & 110 & 136
-%         \end{pmatrix}, \\
-%         (\sym{\ten{A}})_{:,:,3} &= \frac{1}{6}\begin{pmatrix}
-%              58 &  84 & 110 \\
-%              84 & 110 & 136 \\
-%             110 & 136 & 162
-%         \end{pmatrix}.
-%     \end{align*}
-% \end{example}
-
-% \begin{example}[Half Vectorization]
-%     The half vectorization of a square matrix
-%     \begin{displaymath}
-%         \mat{A} = \begin{pmatrix}
-%             1 & 4 & 7 \\
-%             2 & 5 & 8 \\
-%             3 & 6 & 9
-%         \end{pmatrix}
-%     \end{displaymath}
-%     is
-%     \begin{displaymath}
-%         \vech{\mat{A}} = (1, 2, 3, 5, 6, 9).
-%     \end{displaymath}
-% \end{example}
-
 \end{document}