We start by introducing the notation we use throughout the paper. Let $\ten{A}=(\ten{A}_{i_1, \ldots, i_r})\in\mathbb{R}^{q_1\times\ldots\times q_r}$ be an order\footnote{Also referred to as rank, therefore the variable name $r$, but this term is \emph{not} used as it leads to confusion with the rank as in ``the rank of a matrix''.}$r$ tensor, where $r\in\mathbb{N}$ is the number of modes or axes (dimensions) of $\ten{A}$ and $\ten{A}_{i_1,...,i_r}\in\mathbb{R}$ is its $(i_1, \ldots, i_r)$th entry. For example, a $p \times q$ matrix $\mat{B}$ has two modes, the rows and columns. For matrices $\mat{B}_k\in\mathbb{R}^{p_k\times q_k}$, $k\in[r]=\{1, 2, \ldots, r\}$, the \emph{multi-linear multiplication}, or \emph{Tucker Operator}\parencite{MultilinearOperators-Kolda2006}, is defined element wise as
which results in an order $r$ tensor of dimension $p_1\times ...\times p_k$. This results in the \emph{$k$-mode product} between the tensor $\ten{A}$ with the matrix $\mat{B}_k$,
Furthermore, the notation $\ten{A}\mlm_{k\in S}\mat{B}_k$ is short hand for the iterative application of the mode product for all indices in $S\subseteq[r]$. For example $\ten{A}\times_{k\in\{2, 5\}}\mat{B}_k =\ten{A}\times_2\mat{B}_2\times_5\mat{B}_5$. By only allowing $S$ to be a set, this notation is unambiguous, because the mode product commutes for different modes: $j\neq k\Rightarrow\ten{A}\times_j\mat{B}_j\times_k\mat{B}_k =\ten{A}\times_k\mat{B}_k\times_j\mat{B}_j$.
%Matrices and tensors can be \emph{vectorized} by the \emph{vectorization} operator $\vec$.
The operator $\vec$ maps an array to a vector. For example, $\vec(\mat{B})$ stands for the $pq \times1$ vector of the $p \times q$ matrix $\mat{B}$ resulting from stacking the columns of $\mat{B}$ one after the other. For a tensor $\ten{A}=(a_{i_1,\ldots,i_r})$ of order $r$ and dimensions $q_1, \ldots, q_r$, $\vec(\ten{A})$ is the $q_1 q_2\ldots q_r \times1$ vector with the elements of $\ten{A}$ stacked one after the other in the specified order $r$ then $r-1$, and so on. For example, if $\ten{A}$ is 3-dimensional array, $\vec(\ten{A})=(\vec(\ten{A}(:,:,1))^T,\vec(\ten{A}(:,:,2)^T,\ldots,\vec(\ten{A}(:,:,q_r)^T)^T$. We use the notation $\ten{A}\equiv\ten{B}$ for objects $\ten{A}, \ten{B}$ of any shape if and only if $\vec(\ten{A})=\vec(\ten{B})$.
The \emph{inner product} between two tensors of the same order and dimensions is
that leads to the definition of the \emph{Frobenius Norm} for tensors, $\|\ten{A}\|_F =\sqrt{\langle\ten{A}, \ten{A}\rangle}$ and is the straightforward extension of the Frobenius norm for matrices and vectors. %are also used for matrices while for a vector $\mat{a}$ the \emph{2 norm} is $\|\mat{a}\|_2 = \sqrt{\langle\mat{a}, \mat{a}\rangle}$.
The \emph{outer product} between two tensors $\ten{A}$ of dimensions $q_1, \ldots, q_r$ and $\ten{B}$ of dimensions $p_1, \ldots, p_l$ is a tensor $\ten{A}\circ\ten{B}$ of order $r + l$ and dimensions $q_1, \ldots, q_r, p_1, \ldots, p_l$ such that
Let $\K : \mathbb{R}^{q_1, ..., q_{2 r}}\to\mathbb{R}^{q_1 q_{r +1}, ..., q_r q_{2 r}}$ be defined element wise with indices $1\leq i_j +1\leq q_j q_{r + j}$ for $j =1, ..., r$ as
where $\lfloor\,.\,\rfloor$ is the floor operation and $a\operatorname{mod}b$ is the integer divition remainder of $a / b$. The mapping $\K$ is a linear operation and maps an order $2 r$ tensor to an order $r$ tensor by reshaping and permuting its elements. This operation allows to define a generalization of the \emph{Kronecker product} which we define as $\ten{A}\otimes\ten{B}=\K(\ten{A}\circ\ten{B})$.
For tensors of order at least $2$, the \emph{flattening} (or \emph{unfolding} or \emph{matricization}) is a reshaping of the tensor into a matrix along a particular mode. For a tensor $\ten{A}$ of order $r$ and dimensions $q_1, \ldots, q_r$, the $k$-mode unfolding $\ten{A}_{(k)}$ is a $q_k\times\prod_{l=1, l\neq k}q_l$ matrix with %For the tensor $\ten{A} = (a_{i_1,\ldots,i_r})\in\mathbb{R}^{q_1, \ldots, q_r}$ the
elements %of the $k$ unfolded tensor $\ten{A}_{(k)}$ are
% The rank of a tensor $\ten{A}$ of dimensions $q_1\times ...\times q_r$ is vector-valued; that is, $\rank(\ten{A}) = (a_1, \ldots, a_r)\in[q_1]\times...\times[q_r]$, where $a_k = \rank(\ten{A}_{(k)})$ is the usual matrix rank of the $k$ unfolded tensor.
The gradient of a function $\ten{F}(\ten{X})$ of any shape, univariate, multivariate or tensor valued, with argument $\ten{X}$ of any shape is defined as
which is a matrix of dimension $p\times q$ where the vectorized quantities $\vec{\ten{X}}\in\mathbb{R}^p$ and $\vec\ten{F}(\ten{X})\in\mathbb{R}^q$. This is consistent with the gradient of a real-valued function $f(\mat{x})$ where $\mat{x}\in\mathbb{R}^p$ as $\nabla_{\mat{x}}f\in\mathbb{R}^{p\times1}$. \todo{Maybe reference magnus and abadir, magnus and neudecker?!}
\todo{$\vech\ten{A}$ the half vectorization! Define via the vector containing the tensor elements with indices in \emph{reflected lexicographic order}? Also don't forget to figure out how to (if even) to define the half vectorization of a tensor (with all modes of the same dimensions)}
\todo{$\sym{\ten{A}}$ tensor needed?!}
\todo{I think the following examples are a good idea for the appendix.}
its vectorization is $\vec{\mat{A}}=\t{(1, 2, 3, 4, 5, 6, 7, 8, 9)}$ and its half vectorization $\vech{\mat{A}}=\t{(1, 2, 3, 5, 6, 9)}$. Let a $\ten{A}$ be a tensor of dimension $3\times3\times3$ given by
Our goal is to identify the cumulative distribution function (cdf) $F$ of $Y\mid\ten{X}$, where $\ten{X}$ is assumed to admit $r$-tensor structure of dimension $p_1\times ... \times p_r$ with continuous or discrete entries and there is no constraint in the form of $Y$. The predictor $\ten{X}$ is a complex object; to simplify the problem we assume their exists a function $\ten{R}:\ten{X}\mapsto\ten{R}(\ten{X})$ such that $\ten{R}(\ten{X})$ is tensor valued of lower dimension so that
\begin{displaymath}
F(Y\mid\ten{X}) = F(Y\mid\ten{R}(\ten{X})).
\end{displaymath}
Since $\ten{R}(\ten{X})$ replaces the predictors without any effect in the conditional cdf of $Y\mid\ten{X}$, it is a \emph{sufficient reduction} for the regression $Y\mid\ten{X}$. We assume the tensor valued $\ten{R}(\ten{X})$ has dimension $q_1\times...\times q_r$ with $q_j\leq p_j$, $j =1, ..., r$, which represents a dimension reduction along every mode of $\ten{X}$. This formulation is flexible as it allows, for example, to select ``important'' modes by reducing ``unimportant'' modes to be $1$ dimensional.
To find such a reduction $\ten{R}$, we leverage the equivalence pointed out in \textcite{FisherLectures-Cook2007},
which means that a \textit{sufficient statistic}$\ten{R}(\ten{X})$ for $Y$ in the inverse regression $\ten{X}\mid Y$, where $Y$ is considered as a parameter indexing the model, is also a sufficient reduction for $\ten{X}$ in the forward regression $Y\mid\ten{X}$. The equivalent inverse regression in \eqref{eq:inverse-regression-sdr} provides exhaustive characterization of $\ten{R}(\ten{X})$.
The usual tool to identify sufficient statistics is the factorization theorem that requires a distributional model. Here we assume the distribution of $\ten{X}\mid Y$ belongs to the \emph{quadratic exponential family} in order to (a) simplify modeling and (b) keep estimation feasible. An important feature of the \emph{quadratic exponential family} is that its members are characterized by their first two moments. Specifically, we assume that $\ten{X}\mid Y$ is a full rank quadratic exponential family with density
where $\mat{t}_1(\ten{X})=\vec\ten{X}$ and $\mat{t}_2(\ten{X})$ is linear in $\ten{X}\circ\ten{X}$. The dependence of $\ten{X}$ on $Y$ is fully captured in the natural parameter $\mat{\eta}_y$. The function $h$ is non-negative real-valued. For $b$ we assume it is at least twice continuously differentiable and structly convex.
Distributions within the quadratic exponential family include the \emph{tensor normal}\todo{cite, if can be found} and \emph{tensor Ising model}\todo{cite} (a generalization of the (inverse) Ising model which is multi-variate Bernoulli with up to second order interactions) and mixtures of these two.
% In model \eqref{eq:quad-density}, the relationship of $\ten{X}$ and $Y$ is absorbed in $\mat{\eta}_y$, and $\mat{t}(\ten{X})$ is the minimal sufficient statistic for the pseudo-parameter $\mat{\eta}_y = (\vec\mat{\eta}_1, \vec\mat{\eta}_2)$ with $\mat{t}(\ten{X}) = (\vec{\ten{X}}, \t{\mat{D}_p}\mat{D}_p\vech(\ten{X}\circ\ten{X}))$ where $\mat{D}_p$ is the duplication matrix \textcite[Ch.~11]{MatrixAlgebra-AbadirMagnus2005}, for $p = \prod_{j = 1}^r p_j$. $\t{\mat{D}_p}\mat{D}_p$ acts on $\vech(\ten{X}\circ\ten{X})$ by multiplying the off diagonal components of $\vech(\ten{X}\circ\ten{X})$ with $2$ while keeping the diagonal components untouched.
% The duplication matrix ensures the natural parameter vector $\mat{\eta}_y$ is minimal.
In model \eqref{eq:quad-density}, the relationship of $\ten{X}$ and $Y$ is absorbed in $\mat{\eta}_y$, and $\mat{t}(\ten{X})$ is the minimal sufficient statistic for the \textit{pseudo}-parameter\footnote{$\mat{\eta}_y$ is a function of the response $Y$, thus it is not a parameter in the formal statistical sense. It is considered as a parameter when using the equivalence in \eqref{eq:inverse-regression-sdr} and view $Y$ as a parameter as a device to derive the sufficient reduction from the inverse regression.}$\mat{\eta}_y =(\mat{\eta}_{1y}, \mat{\eta}_{2y})$ with
where the $d\times p(p +1)/2$ dimensional matrix $\mat{T}_2$ with $p =\prod_{i =1}^r p_i$ ensures that $\mat{\eta}_{2y}$ is of minimal dimension $d$. The matrix $\mat{T}_2$ is of full rank $d$ and is unique to specific members of the quadratic exponential family.
We can reexpress the exponent in \eqref{eq:quad-density} as
where $\mat{D}_p$ is the \emph{duplication matrix} from \textcite[Ch.~11]{MatrixAlgebra-AbadirMagnus2005}, defined so that $\mat{D}_p\vech\mat{A}=\vec\mat{A}$ for every symmetric $p\times p$ matrix $\mat{A}$, and $\pinv{\mat{D}_p}$ is its Moore-Penrose pseudo inverse. The first natural parameter component, $\mat{\eta}_{1y}$, captures the first order, and $\mat{\eta}_{2y}$, the second order relationship of $Y$ and $\ten{X}$. The quadratic exponential density of $\ten{X}\mid Y$ can then be expressed as
The exponential family in \eqref{eq:quadratic-exp-fam} is easily generalizable to any order. This, though, would result in the number of parameters becoming prohibitive to estimate. This is also the reason why we opted for the second order exponential family in our formulation.
By the equivalence in \eqref{eq:inverse-regression-sdr}, in order to find the sufficient reduction $\ten{R}(\ten{X})$ we need to infer $\mat{\eta}_{1y}$, and $\mat{\eta}_{2y}$. This is reminiscent of generalized linear modeling, which we extend to a multi-linear formulation next.
Suppose $\ten{F}_y$ is a known mapping of $y$%with values in $\mathbb{R}^{q_1\times ...\times q_r}$
with zero expectation $\E_Y\ten{F}_Y =0$. We assume the dependence of $\ten{X}$ and $Y$ is reflected only in the first parameter and let
where $\overline{\ten{\eta}}\in\mathbb{R}^{p_1\times\ldots\times p_r}$, $\mat{\Omega}\in\mathbb{R}^{p \times p}$ is positive definite with $p =\prod_{j =1}^{r} p_j$, and $c\in\mathbb{R}$ is a known constant determined by the distribution to ease modeling. That is, we assume that only $\mat{\eta}_{1y}$ depends on $Y$ through $\mat{B}$. The second parameter $\mat{\eta}_2$ captures the second order interaction structure of $\ten{X}$, which we assume not to depend on the response $Y$. In order to relate individual modes of $\ten{X}$ to the response, allowing flexibility in modeling, we assume $\ten{F}_y$ takes values in $\mathbb{R}^{q_1\times ...\times q_r}$; that is, $\ten{F}_y$ is a tensor valued independent variable in the inverse regression setting. This, in turn, leads to imposing corresponding tensor structure to the regression parameter $\mat{B}$. Thus, \eqref{eq:eta1-manifold} becomes
where $\mat{B}=\bigotimes_{j = r}^{1}\mat{\beta}_j$ and the component matrices $\mat{\beta}_j\in\mathbb{R}^{p_j\times q_j}$ are of known rank for $j =1, \ldots, r$. Given the high potential value of $p$, we further assume that
where $\mat{\Omega}_j\in\mathbb{R}^{p_j\times p_j}$ are symmetric positive definite matrices for $j =1, \ldots, r$. That is, we require $\mat{\Omega}=\bigotimes_{j = r}^{1}\mat{\Omega}_j$, which substantially reduces the number of parameters to estimate in $\mat{\Omega}$. The assumption that the $\mat{\Omega}_j$'s be positive definite is possible due to the constant $c$.
Equation \eqref{eq:eta2} is underdetermined since $\t{(\mat{T}_2\pinv{\mat{D}_p})}$ has full column rank $d < p^2$ (with a non-strict inequality if $\ten{X}$ is univariate) but $\mat{\eta}_2$ is uniquely determined given any $\mat{\Omega}$ as $\t{(\pinv{(\mat{T}_2\pinv{\mat{D}_p})})}$ has full row rank. We let $\mat{\xi}=(\vec{\overline{\ten{\eta}}}, \vec{\mat{B}}, \vech{\mat{\Omega}})$ be a $p(p +2 q +3)/2$-parameter vector and $\mat{\theta}=(\vec{\overline{\ten{\eta}}}, \vec{\mat{B}}, \vech{\mat{\Omega}})$ be the constrained parameter vector, where $\mat{B}=\bigotimes_{j = r}^{1}\mat{\beta}_j$ and $\mat{\Omega}=\bigotimes_{j = r}^{1}\mat{\Omega}_j$. We also let $\Xi$ and $\Theta$ denote the unconstrained and constrained parameter spaces, with $\mat{\xi}$ and $\mat{\theta}$ varying in $\Xi$ and $\Theta$, respectively. The parameter space $\Xi$ is an open subset of $\mathbb{R}^{p(p +2 q +3)/2}$ so that \eqref{eq:quadratic-exp-fam} is a proper density.
\todo{Maybe already here introduce the ``constraint'' set of $\Omega$'s allowed as $\{\Omega\in\mathbb{R}_{++}^{p\times p} : \vec{\Omega}=\t{(\pinv{(\mat{T}_2\pinv{\mat{D}_p})}\mat{T}_2\pinv{\mat{D}_p})}\vec{\Omega}\}$}
In a classical \emph{generalized linear model} (GLM), the link function connecting the natural parameters to the expectation of the sufficient statistic $\mat{\eta}_y =\mat{g}(\E[\mat{t}(\ten{X})\mid Y = y])$ is invertible. Such a link may not exist in our setting, but for our purpose what we call the ``inverse'' link suffices. The ``inverse'' link $\widetilde{\mat{g}}$ exists as the natural parameters fully describe the distribution. As in the non-minimal formulation \eqref{eq:quadratic-exp-fam}, we define the ``inverse'' link through its tensor valued components
as $\widetilde{\mat{g}}(\mat{\eta}_y)=(\vec\ten{g}_1(\mat{\eta}_y), \vec\ten{g}_2(\mat{\eta}_y))$.
Under the quadratic exponential family model \eqref{eq:quadratic-exp-fam}, a sufficient reduction for the regression of $Y$ on $\ten{X}$ is given in \cref{thm:sdr}.
A sufficient reduction for the regression $Y\mid\ten{X}$ under the quadratic exponential family inverse regression model \eqref{eq:quadratic-exp-fam} with natural parameters modeled \eqref{eq:eta1} and \eqref{eq:eta2} is given by
The reduction \eqref{eq:sdr} is minimal if $\mat{\beta}_j$ are full rank for all $j=1,\ldots,r$.
\end{theorem}
The reduction in vectorized form is $\vec\ten{R}(\ten{X})=\t{\mat{B}}\vec(\ten{X}-\E\ten{X})$, where $\mat{B}=\bigotimes_{k = r}^{1}\mat{\beta}_k$ with $\Span(\mat{B})=\Span(\{\mat{\eta}_{1y}-\E_{Y}\mat{\eta}_{1Y} : y\in\mathcal{S}_Y\})$, using $\mathcal{S}_Y$ to denote the set of values of the random variable $Y$.
\cref{thm:sdr} obtains that the \emph{sufficient reduction}$\ten{R}(\ten{X})$ reduces $\ten{X}$ along each dimension linearly. The graph in \cref{fig:SDRvisual} is a visual depiction of the sufficient reduction.
\begin{figure}
\centering
\begin{scaletikzpicturetowidth}{0.5\textwidth}
\input{images/reduction.tex}
\end{scaletikzpicturetowidth}
\caption{\label{fig:SDRvisual}Visual depiction of the sufficient reduction in \cref{thm:sdr}.}
Given vector valued predictor $\mat{X}\in\mathbb{R}^p$, the tensor order is $r =1$, then the collection of parameters is $\mat{\theta}=(\overline{\mat{\eta}}, \mat{\beta}, \mat{\Omega})$ with $\overline{\mat{\eta}}\in\mathbb{R}^p$, $\mat{\beta}\in\StiefelNonCompact{p}{q}$ and $\mat{\Omega}\in\SymPosDefMat{p}$ where $\mat{f}_y\in\mathbb{R}^q$ are known functions of the response $Y$. The conditional density of $\mat{X}\mid Y = y$ is given by
using the relation of $\mat{\theta}$ to the natural parameters given by $\mat{\eta}_{1y}(\mat{\theta})=\overline{\mat{\eta}}+\mat{\beta}\mat{f}_y$ and $\mat{\eta}_2(\theta)= c\,\mat{\Omega}$.
% where the number of unknown parameters is $p + \dim(\StiefelNonCompact{p}{q}) + \dim(\SymPosDefMat{p}) = p\frac{p + 2 q + 3}{2}$.
\end{example}
\begin{example}[Matrix valued $\mat{X}$ ($r =2$)]
Assuming $\mat{X}$ to be matrix valued, that is $r =2$, $\mat{\theta}=(\overline{\mat{\eta}}, \mat{\beta}_1, \mat{\beta}_2, \mat{\Omega}_1, \mat{\Omega}_2)$, where the intercept term $\overline{\mat{\eta}}\in\mathbb{R}^{p_1\times p_2}$ is now matrix valued. Similar to \cref{ex:vector-dist} with $\mat{F}_y\in\mathbb{R}^{q_1\times q_2}$ being matrix valued, the conditional density of $\mat{X}\mid Y = y$ reads
Suppose $(\ten{X}_i, Y_i)$ are independently and identically distributed with joint cdf $F(\ten{X}, Y)$, for $i =1, \ldots, n$. The empirical log-likelihood function of \eqref{eq:quadratic-exp-fam} under \eqref{eq:eta1} and \eqref{eq:eta2}, ignoring terms not depending on the parameters, is
with $\hat{\mat{\theta}}_n =(\vec\widehat{\overline{\ten{\eta}}}, \vec\widehat{\mat{B}}, \vech\widetilde{\mat{\Omega}})$ where $\widehat{\mat{B}}=\bigkron_{k = r}^{1}\widehat{\mat{\beta}}_k$ and $\widehat{\mat{\Omega}}=\bigkron_{k = r}^{1}\widehat{\mat{\Omega}}_k$.
A straightforward and general method for parameter estimation is \emph{gradient descent}. For applying gradient based optimization, we compute the gradients of $l_n$ in \cref{thm:grad}.
For $n$ i.i.d. observations $(\ten{X}_i, y_i), i =1, ..., n$ the log-likelihood has the form \eqref{eq:log-likelihood} with $\mat{\theta}$ being the collection of all GMLM parameters $\overline{\ten{\eta}}$, ${\mat{B}}=\bigkron_{k = r}^{1}{\mat{\beta}}_k$ and ${\mat{\Omega}}=\bigkron_{k = r}^{1}{\mat{\Omega}}_k$ for $k =1, ..., r$. Furthermore, let $\ten{G}_2(\mat{\eta}_y)$ be a tensor of dimensions $p_1, \ldots, p_r$ such that
Then, the partial gradients with respect to $\overline{\ten{\eta}}, \mat{\beta}_1, \ldots, \mat{\beta}_r, \mat{\Omega}_1, \ldots, \mat{\Omega}_r$ are given by
Although the general case of any GMLM model can be fitted via gradient descent using \cref{thm:grad}, this may be very inefficient. In case of the tensor normal distribution an iterative cyclic updating scheme can be derived by setting the partial gradients of \cref{thm:grad} to zero. Described in more details in \cref{sec:tensor_normal_estimation}, this method has much faster convergence, is stabel and does not requires any hyper parameters. On the other hand, the Ising model does not allow such a scheme. There we need to use a gradient based method, which is the subject of \cref{sec:ising_estimation}.
With $\ten{X}\mid Y = y$ following a tensor normal distribution, its density, with mean $\ten{\mu}_y$ and covariance $\mat{\Sigma}=\bigkron_{k = r}^{1}\mat{\Sigma}_k$ is given by
We assume the distribution is non-degenerate which means that the covariances $\mat{\Sigma}_k$ are symmetric positive definite matrices. For the sake of simplicity, we w.l.o.g. assume $\ten{X}$ to have marginal expectation equal zero, $\E\ten{X}=0$. Rewriting this in the quadratic exponential family form \eqref{eq:quadratic-exp-fam}, determines tha scaling constant $c =-1/2$, and give the relation to the GMLM parameters $\overline{\ten{\eta}}, \mat{\beta}_k$ and $\mat{\Omega}_k$, for $k =1, \ldots, r$ as
where we used that $\overline{\ten{\eta}}=0$ due to $0=\E\ten{X}=\E\E[\ten{X}\mid Y]=\E\ten{\mu}_Y$ in combination with $\E\ten{F}_Y =0$. Additionally, all the $\mat{\Omega}_k$'s are symmetric positive definite, because the $\mat{\Sigma}_k$'s are. This lead to another simplification the matrix $\mat{T}_2$ from \eqref{eq:t-stat} is equal to the identity. This also means that the gradients of the log-likelihood $l_n$ in \cref{thm:grad} are simpler as well. We get
\begin{displaymath}
\ten{g}_1(\mat{\eta}_y) = \E[\ten{X}\mid Y = y] = \ten{\mu}_y, \qquad
In practice, we start the estimation process by demeaning a given data set $(\ten{X}_i, \ten{F}_{y_i})$ with $i =1, \ldots, n$ observations. After demenaing, only the reduction matrices $\mat{\beta}_k$ and the scatter matrices $\mat{\Omega}_k$ need to be estimated. Then, to solve the optimization problem \eqref{eq:mle}, with $\overline{\ten{\eta}}=0$ after demeaning, we need some initial values. For initial estimates $\hat{\mat{\beta}}_k^{(0)}$ we use a simple heuristic approach. First, we compute moment based mode-wise marginal covariance estimates $\widehat{\mat{\Sigma}}_k(\ten{X})$ and $\widehat{\mat{\Sigma}}_k(\ten{F}_Y)$ given by
Then, for every mode $k =1, \ldots, r$, we compute the first $j =1, \ldots, q_k$ eigen vectors $\mat{v}_j(\widehat{\mat{\Sigma}}_k(\ten{X}))$, $\mat{v}_j(\widehat{\mat{\Sigma}}_k(\ten{F}_Y))$ and eigen values $\lambda_j(\widehat{\mat{\Sigma}}_k(\ten{X}))$, $\lambda_j(\widehat{\mat{\Sigma}}_k(\ten{X}))$ of the marginal covariance estimates. Let,
The initial values of $\mat{\Omega}_k$ are set to the identity $\mat{\Omega}_k^{(0)}=\mat{I}_{p_k}$.
Given any estimates $\hat{\mat{\beta}}_1, \ldots, \hat{\mat{\beta}}_r, \hat{\mat{\Omega}}_1, \ldots, \hat{\mat{\Omega}}_r$, we take the gradient $\nabla_{\mat{\beta}_j}$ of the log-likelihood $l_n$ in \eqref{eq:log-likelihood} of the tensor normal distributed from \cref{thm:grad} and keep all other parameters except $\mat{\beta}_j$ fixed. Then, $\nabla_{\mat{\beta}_j}l_n =0$ has a closed form solution
For the scatter matrices $\mat{\Omega}_j$, we need to fudge a bit. Equating the partial gradient of the $j$'th scatter matrix $\mat{\Omega}_j$ in \cref{thm:grad} to zero, $\nabla_{\mat{\Omega}_j}l_n =0$, gives a quadratic matrix equation. This is due to the dependence of $\ten{\mu}_y$ on $\mat{\Omega}_j$. In practive thoug, it is more stable, faster and equaly accurate to use mode-wise covariance estimates via the residuals
which only lacks scaling $\tilde{s}\tilde{\mat{\Sigma}}_j =\hat{\mat{\Omega}}_j^{-1}$. The scaling condition is that the mean squared error has to be equal to the trace of the covariance estimate,
resulting in the estimates $\hat{\mat{\Omega}}_j =(\tilde{s}\tilde{\mat{\Sigma}}_j)^{-1}$.
Estimation is then performed by updating for $j =1, \ldots, r$ the estimates $\hat{\mat{\beta}}_j$ via \eqref{eq:tensor_normal_beta_solution}, then we recompute the $\hat{\mat{\Omega}}_j$ estimates simultaniously keeping the $\hat{\mat{\beta}}_j$'s fixed. This procedure is repeated untill convergence. Convergence is very fast, experiments showed that convergence occures usualy in less than $10$ iterations.
A technical detail for numerical stability is to ensure that the scaled inbetween values $\tilde{s}\tilde{\mat{\Sigma}}_j$, assumed to be symmetric and positive definite, are well conditioned. Otherwise, the procedure can be numerically unstable by attempting to compute $\hat{\mat{\Omega}}_j =(\tilde{s}\tilde{\mat{\Sigma}}_j)^{-1}$. Therefore, we estimate the condition number of $\tilde{s}\tilde{\mat{\Sigma}}_j$ before computing the inverse. In case of being ill conditioned, we regularize and use $\hat{\mat{\Omega}}_j =(\tilde{s}\tilde{\mat{\Sigma}}_j +0.2\lambda_{1}(\tilde{s}\tilde{\mat{\Sigma}}_j)\mat{I}_{p_j})^{-1}$ instead, where $\lambda_{1}(\tilde{s}\tilde{\mat{\Sigma}}_j)$ is the first (maximum) eigen value. Experiments showed that this regularization is usually required in the first few iterations only.
Furthermore, we may assume that the parameter space follows a more generel setting as in \cref{thm:param-manifold}. In this case, updating may produces estimates outside of the parameter space. A simple and efficient method is to project every updated estimate onto the corresponding manifold.
\cref{thm:sdr} identifies the sufficient reduction for the regression of $Y$ on $\ten{X}$ in the population. Any estimation of the sufficient reduction requires application of some optimality criterion. As we operate within the framework of the exponential family, we opted for maximum likelihood estimation (MLE). For the unconstrained problem, where the parameters are simply $\mat{B}$ and $\mat{\Omega}$ in \eqref{eq:eta1-manifold}, maximizing the likelihood of $\ten{X}\mid Y$ is straightforward and yields well-defined MLEs of both parameters. Our setting, though, requires the constrained optimization of the $\ten{X}\mid Y$ likelihood subject to $\mat{B}=\bigotimes_{j = r}^{1}\mat{\beta}_j$ and $\mat{\Omega}=\bigkron_{j = r}^{1}\mat{\Omega}_j$. \Cref{thm:kron-manifolds,thm:param-manifold} provide the setting for which the MLE of the constrained parameter $\mat{\theta}$ is well-defined, which in turn leads to the derivation of its asymptotic normality.
The main problem to be solves for asymtotic results of the MLE of the constraint parameter $\mat{\theta}=(\overline{\ten{\eta}}, \vec\mat{B}, \vech\mat{\Omega})$ derives from the nature of the constraint. We assumed that $\mat{B}=\bigkron_{k = r}^{1}\mat{\beta}_k$, where the parameter $\mat{B}$ is identifiable. This means that different values of $\mat{B}$ lead to different densities $f_{\mat{\theta}}(\ten{X}\mid Y = y)$, a basic property needed to ensure consistency of parameter estimates, which in turn is needed for asymptotic normality. On the other hand, the components $\mat{\beta}_j$ for $j =1, \ldots, r$ are \emph{not} identifyable. A direct consequence of the identity $\mat{\beta}_2\otimes\mat{\beta}_1=(c\mat{\beta}_2)\otimes(c^{-1}\mat{\beta}_1)$ for every $c\neq0$. This is the reason we formulated $\Theta$ as a constraint parameter space instead of parameterizing the densities of $\ten{X}\mid Y$ with respect to the components $\mat{\beta}_1, \ldots, \mat{\beta}_r$. The situation for $\mat{\Omega}=\bigkron_{k = r}^{1}\mat{\Omega}_k$ is the same.
In addition to identifyable parameters, asymtotic normality given in \cref{thm:asymptotic-normality-gmlm} requires the notion of differentiation. Therefore, the space itself needs to be able to define differentiation, usually a vector space. This is a too strong assumption for our purposes. To weaken the vector space assumption we considure \emph{smooth manifolds}. Basically, spaces which localy look like Euclidean space and allow the notion of differentiation. The more general \emph{topological} manifolds are too weak for differentiation. To make matters worse, a smooth manifold only allows first derivatives. Without goint into any details, a \emph{Riemannian manifold} is the solution. Similar to an abstract \emph{smooth manifold}, those spaces are detached from our usual intuition as well as complicated to handle in an already complicated setting. This is where an \emph{embedded (sub)manifold} comes to rescue. Simply speaking, an embedded manifold is a manifold which is a subset of a manifold from which it inherits its properties. Now, if a manifold is embedded in Euclidean space, almost all the complication of the abstract manifold theory simplifies drastically. Moreover, since Euclidean space is itself a Riemannian manifold, we inherit the means for higher derivatives. Finally, smooth embedded submanifold structure for the parameter space maintains consistency with existing approaches and results for parameter sets with linear subspace structure. Those reasons are the argument to ensure that the constraint parameter space $\Theta$ is an \emph{smooth embedded submanifold} in an open subset $\Xi$ of Euclidean space.
Now, we diretly define a \emph{smooth manifold} embedded in $\mathbb{R}^p$ without any detours to the more generel theory. See for example \textcite{introToSmoothMani-Lee2012,,introToRiemannianMani-Lee2018,optimMatrixMani-AbsilEtAl2007,aufbauAnalysis-kaltenbaeck2021} among others.
\begin{definition}[Manifolds]\label{def:manifold}
A set $\manifold{A}\subseteq\mathbb{R}^p$ is an \emph{embedded smooth manifold} of dimension $d$ if for every $\mat{x}\in\manifold{A}$ there exists a smooth\footnote{Here \emph{smooth} means infinitely differentiable or $C^{\infty}$.} bi-continuous map $\varphi:U\cap\manifold{A}\to V$, called a \emph{chart}, with $\mat{x}\in U\subseteq\mathbb{R}^p$ open and $V\subseteq\mathbb{R}^d$ open.
\end{definition}
We also need the concept of a \emph{tangent space} to formulate asymtotic normality in a way which is independent of a particular coordinate representation. Intuitively, the tangent space at a point $\mat{x}\in\manifold{A}$ of the manifold $\manifold{A}$ is the hyperspace of all velocity vectors $\t{\nabla\gamma(0)}$ of any curve $\gamma:(-1, 1)\to\manifold{A}$ passing through $\mat{x}=\gamma(0)$, see \cref{fig:torus}. Locally, at $\mat{x}=\gamma(0)$ with a chart $\varphi$ we can written $\gamma(t)=\varphi^{-1}(\varphi(\gamma(t)))$ which gives that $\Span\t{\nabla\gamma(0)}\subseteq\Span\t{\nabla\varphi^{-1}(\varphi(\mat{x}))}$. Taking the union over all smooth curves through $\mat{x}$ gives equality. The following definition leverages the simplified setup of smooth manifolds in Euclidean space.
Let $\manifold{A}\subseteq\mathbb{R}^p$ be an embedded smooth manifold and $\mat{x}\in\manifold{A}$. The \emph{tangent space} at $\mat{x}$ of $\manifold{A}$ is defined as
\caption{\label{fig:torus}Visualization of the tangent space $T_{\mat{x}}\manifold{A}$ at $\mat{x}$ of the torus $\manifold{A}$. The torus $\manifold{A}$ is a 2-dimensional embedded manifold in $\mathbb{R}^3$. The tangent space $T_{\mat{x}}\manifold{A}\subset\mathbb{R}^3$ is a the 2-dimensional hyperplane visualized with its origin $\mat{0}$ shifted to $\mat{x}$. Moreover, two curves $\gamma_1, \gamma_2$ on the torus are drawn with $\mat{x}=\gamma_1(0)=\gamma_2(0)$. The curve velocity vectors $\t{\nabla\gamma_1(0)}$ and $\t{\nabla\gamma_2(0)}$ are drawn as tangent vectors with root $\mat{x}$.}
\end{figure}
As a basis to ensure that the constraint parameter space $\Theta$ is a manifold, which is the statement of \cref{thm:param-manifold}, we need \cref{thm:kron-manifolds}. Therefore, we need the notion of a \emph{spherical} set, which is a set $\manifold{A}$, on which the Frobenius norm is constant. That is, $\|\,.\,\|_F:\manifold{A}\to\mathbb{R}$ is constant. Forthermore, we call a scale invariant set $\manifold{A}$ a \emph{cone}, that is $\manifold{A}=\{ c \mat{A} : \mat{A}\in\manifold{A}\}$ for all $c > 0$.
Let $\manifold{A}\subseteq\mathbb{R}^{p_1\times q_1}\backslash\{\mat{0}\}, \manifold{B}\subseteq\mathbb{R}^{p_2\times q_2}\backslash\{\mat{0}\}$ be smooth embedded submanifolds. Assume one of the following conditions holds.
\begin{itemize}
\item[-] ``sphere condition'':
At least one of $\manifold{A}$ or $\manifold{B}$ is \emph{spherical} and let $d =\dim\manifold{A}+\dim\manifold{B}$.
\item[-] ``cone condition'':
Both $\manifold{A}$ and $\manifold{B}$ are \emph{cones} and let $d =\dim\manifold{A}+\dim\manifold{B}-1$.
\end{itemize}
Then, $\{\mat{A}\otimes\mat{B} : \mat{A}\in\manifold{A}, \mat{B}\in\manifold{B}\}\subset\mathbb{R}^{p_1 p_2\times q_1 q_2}$ is a smooth embedded $d$-manifold.
where $\manifold{B}_k\subset\mathbb{R}^{p_k\times q_k}\backslash\{\mat{0}\}$ and $\manifold{O}_k\subset\mathbb{R}^{p_k\times p_k}\backslash\{\mat{0}\}$ are smooth embedded manifolds which are ether spheres or cones, for $k =1, ..., r$. Furthermore, let
then the constrained parameter space $\Theta=\mathbb{R}^p \times\manifold{K}_{\mat{B}}\times\manifold{CK}_{\mat{\Omega}}\subset\mathbb{R}^{p(p +2 q +3)/2}$ is a smooth embedded manifold.
\item$\mat{\Omega}$ is a Kronecker product: Kronecker products of pos. def. matrices are elements of a manifold
\item$\mat{B}$ is a Kronecker product: any combination of $\mat{\beta}_j$ full rank, rank $r$, or semi-orthogonal matrices would be an element of a manifold
\end{enumerate}
}
\subsection{Asymptotics}
Let $Z$ be a random variable distributed according to a parameterized probability distribution with density $f_{\mat{\theta_0}}\in\{ f_{\mat{\theta}} : \mat{\theta}\in\Theta\}$ where $\Theta$ is a subset of Euclidean space. We want to estimate the parameter ${\mat{\theta}}_0$ using $n$ i.i.d. (independent and identically distributed) copies of $Z$. We assume a known, real-valued and measurable function $z\mapsto m_{\mat{\theta}}(z)$ for every $\mat{\theta}\in\Theta$ while ${\mat{\theta}}_0$ maximizes the map $\mat{\theta}\mapsto M(\mat{\theta})=\E m_{\mat{\theta}}(Z)$ uniquely. For the estimation we maximize the empirical version
An \emph{M-estimator}$\hat{\mat{\theta}}_n =\hat{\mat{\theta}}_n(Z_1, ..., Z_n)$ is a maximizer for the objective function $M_n$ over the parameter space $\Theta$ defined as
It is not necessary to have a perfect maximizer, as long as the objective has finite supremum, it is sufficient to take an \emph{almost maximizer}$\hat{\mat{\theta}}_n$ as defined in the following;
\begin{definition}[weak and strong M-estimators]
An estimator $\hat{\mat{\theta}}_n$ for the objective function $M_n$ in \eqref{eq:Mn} with $\sup_{\mat{\theta}\in\Theta}M_n(\mat{\theta}) < \infty$ such that
Assume $Z =(\ten{X}, Y)$ satisfies model \eqref{eq:quadratic-exp-fam} subject to \eqref{eq:eta1-manifold} and \eqref{eq:eta2-manifold} with true constrained parameter $\mat{\theta}_0=(\overline{\eta}_0, \mat{B}_0, \mat{\Omega}_0)\in\Theta$, where $\Theta$ is defined in \cref{thm:param-manifold}. Under the regularity \crefrange{cond:differentiable-and-convex}{cond:finite-sup-on-compacta} in the appendix, there exists a strong M-estimator sequence $\hat{\mat{\theta}}_n$ deriving from $l_n$ in \eqref{eq:log-likelihood} over $\Theta$. Furthermore, any strong M-estimator $\hat{\mat{\theta}}_n$ converges in probability to the true parameter $\mat{\theta}_0$ over $\Theta$. That is, $\hat{\mat{\theta}}_n\xrightarrow{p}\mat{\theta}_0$. Moreover, every strong M-estimator $\hat{\mat{\theta}}_n$ is asymptotically normal,
The following is a reformulation of \textcite[Lemma~2.3]{asymptoticMLE-BuraEtAl2018} which assumes Condition~2.2 to hold. But the existence of a mapping in Condition~2.2 is not needed for Lemma~2.3. It suffices that the restricted parameter space $\Theta$ is a subset of the unrestricted parameter space $\Xi$, which is trivially satisfied in our setting. Under this, \cref{thm:exists-strong-M-estimator-on-subsets} follows directly from \textcite[Lemma~2.3]{asymptoticMLE-BuraEtAl2018}.
\begin{theorem}[Existence of strong M-estimators on Subsets]\label{thm:exists-strong-M-estimator-on-subsets}
Assume there exists a (weak/strong) M-estimator $\hat{\mat{\xi}}_n$ for $M_n$ over $\Xi$, then there exists a strong M-estimator $\hat{\mat{\theta}}_n$ for $M_n$ over any non-empty $\Theta\subseteq\Xi$.
\end{theorem}
\begin{theorem}[Existence and Consistency of M-estimators on Subsets]\label{thm:M-estimator-consistency-on-subsets}
Let $\Xi$ be a convex open subset of a Euclidean space and $\Theta\subseteq\Xi$ non-empty. Assume $\mat{\xi}\mapsto m_{\mat{\xi}}(z)$ is a strictly concave function on $\Xi$ for almost all $z$ and $z\mapsto m_{\mat{\xi}}(z)$ is measurable for all $\mat{\xi}\in\Xi$. Furthermore, let $M(\mat{\xi})=\E m_{\mat{\xi}}(Z)$ be a well defined function with a unique maximizer $\mat{\theta}_0\in\Theta\subseteq\Xi$, that is $M(\mat{\theta}_0) > M(\mat{\xi})$ for all $\mat{\xi}\neq\mat{\theta}_0$. Also, let for every non-empty compact $K\subset\Xi$
Then, there exists a strong M-estimator $\hat{\mat{\theta}}_n$ of $M_n(\mat{\theta})=\frac{1}{n}\sum_{i =1}^{n} m_{\mat{\theta}}(Z_i)$ over the subset $\Theta$. Moreover, any strong M-estimator $\hat{\mat{\theta}}_n$ of $M_n$ over $\Theta$ converges in probability to $\mat{\theta}_0$, that is $\hat{\mat{\theta}}_n\xrightarrow{p}\mat{\theta}_0$.
\end{theorem}
\todo{The assumptions of the following can be a bit weakened, is this neccessary? For example the Hessian can be singular but is non-singular constraint to the tangent space. We can also only define $\mat{\theta}\mapsto m_{\mat{\theta}}$ only on the manifold which makes the statement much more technical, but way more general while we need to ensure that every smooth local extention of $\mat{\theta}\mapsto m_{\mat{\theta}}$ yields the same statement, which it does, but well, then it gets more complicated! Maybe add these things as a remark? The even more general formulation for Riemannian Manifolds is definitely over the top!}
\begin{theorem}[Asymptotic Normality for M-estimators on Manifolds]\label{thm:M-estimator-asym-normal-on-manifolds}
Let $\Theta\subseteq\mathbb{R}^p$ be a smooth embedded manifold. For each $\mat{\theta}$ in a neighborhood in $\mathbb{R}^p$ of the true parameter $\mat{\theta}_0\in\Theta$ let $z\mapsto m_{\mat{\theta}}(z)$ be measurable and $\mat{\theta}\mapsto m_{\mat{\theta}}(z)$ be differentiable at $\mat{\theta}_0$ for almost all $z$. Assume also that there exists a measurable function $u$ such that $\E[u(Z)^2] < \infty$, and for almost all $z$ as well as all $\mat{\theta}_1, \mat{\theta}_2$ in a neighborhood of $\mat{\theta}_0$ such that
Moreover, assume that $\mat{\theta}\mapsto\E[m_{\mat{\theta}}(Z)]$ admits a second-order Taylor expansion at $\mat{\theta}_0$ in a neighborhood of $\mat{\theta}_0$ in $\mathbb{R}^p$ with a non-singular Hessian $\mat{H}_{\mat{\theta}_0}=\nabla^2_{\mat{\theta}}\E[m_{\mat{\theta}}(Z)]|_{\mat{\theta}=\mat{\theta}_0}\in\mathbb{R}^{p\times p}$.
If $\hat{\mat{\theta}}_n$ is a strong M-estimator of $\mat{\theta}_0$ in $\Theta$, then $\hat{\mat{\theta}}_n$ is asymptotically normal
where $\mat{\Pi}_{\mat{\theta}_0}=\mat{P}_{\mat{\theta}_0}\pinv{(\t{\mat{P}_{\mat{\theta}_0}}\mat{H}_{\mat{\theta}_0}\mat{P}_{\mat{\theta}_0})}\t{\mat{P}_{\mat{\theta}_0}}$ and $\mat{P}_{\mat{\theta}_0}$ is any matrix whose span is the tangent space $T_{\mat{\theta}_0}\Theta$ of $\Theta$ at $\mat{\theta}_0$.
\end{theorem}
\begin{remark}
\cref{thm:M-estimator-asym-normal-on-manifolds} has as special case Theorem~5.23 in \textcite{asymStats-van_der_Vaart1998}, when $\Theta$ is open subset of a Euclidean space as opposed to a smooth embedded manifold.
\todo{I don't like it that much, mention that an open set if an embedded manifold implying that $\mat{\Pi}_{\mat{\theta}_0}=\mat{H}_{\mat{\theta}_0}^{-1}$}
% The \emph{tensor normal distribution} $\mathcal{TN}(\ten{\mu}, \mat{\Sigma}_1, ..., \mat{\Sigma}_r)$ is a generalization of the \emph{matrix normal distribution} $\mathcal{MN}(\mat{\mu}, \mat{\Sigma}_1, \mat{\Sigma}_2)$. \todo{ref} The density of the conditional tensor normal distribution $\ten{X}\mid Y = y$ according to the quadratic exponential family \eqref{eq:quadratic-exp-fam} where only the first moment depends on $y$ is given by
% Rewriting this in the form of an exponential family as in \eqref{eq:quadratic-exp-fam} allows to determin the natural parameter components $\mat{\eta}_{y1}$ and $\mat{\eta}_2$. Since a non-degenerate normal distribution requires the covariance matrices $\mat{\Sigma}_k$ to be symmetric positive definite the relation to the second moment natural parameter $\mat{\eta}_2$ simplifies as we can set $\mat{T}_2$ in \eqref{eq:eta2-manifold} to the identity. This then gives the relation to the natural parameters as in \eqref{eq:eta1-manifold} and \eqref{eq:eta2-manifold} as
% with scaling constant $c = -1 / 2$. Modeling the natural parameters as in \eqref{eq:eta1} and \eqref{eq:eta2} relates the mean $\ten{\mu}_y$ and the covariance matrices $\mat{\Sigma}_k$ of the tensor normal to the generalized multi-linear model parameter $\overline{\ten{\eta}}$ and $\mat{\beta}_k$, $\mat{\Omega}_k$, for $k = 1, \ldots, r$ through
% This completely determines the tensor normal distribution given the GMLM parameter.
% To estimate the GMLM parameters $\mat{\theta} = (\overline{\ten{\eta}}, \mat{\beta}_1, \ldots, \mat{\beta}_r, \mat{\Omega}_1, \ldots\mat{\Omega}_r)$ given a data set $(\ten{X}_i, y_i)$ of $i = 1, \ldots, n$ observation we use the gradients provided by \cref{thm:grad}. It turns out that the equations $\nabla_{\overline{\ten{\eta}}}l_n = 0, \nabla_{\mat{\beta}_j}l_n = 0$ and $\nabla_{\mat{\Omega}_j}l_n = 0$, for $j = 1, \ldots, r$, can be solved for the differentiation variable assuming all the other parameter blocks to be constant. Centering the observed $\ten{X}$ leads to a cleaner formulation given by \todo{fix the following!}
% This allows to use a \emph{block coordinate descent} method instead of gradient descent. This method keeps all but one parameter block fixed and optimized the objective for a single block. Given a closed form solution for the partial gradients, only a single update is required to solve the partial optimization problem. This means that the block coordinate descent method reduces to a cyclic updating. This not only converges very fast, it also does not require any hyper parameters.
% For this iterative scheme we do need some initial estimates. For those we
% \subsection{Matrix Normal and Ising Model}
% If $\mat{X} \in \mathbb{R}^{p_1 \times p_2}$, then $\mat{\theta} = (\overline{\eta}, \mat{\beta}_1, \mat{\beta}_2, \mat{\Omega}_1, \mat{\Omega}_2)$ and $\mat{F}_y$ is also matrix valued. The conditional pdf of $\mat{X}\mid Y$ is
% The MLE estimate $\hat{\mat{\theta}}_n = (\widehat{\overline{\mat{\eta}}}, \widehat{\mat{\beta}}_2\otimes \widehat{\mat{\beta}}_1, \widehat{\mat{\Omega}}_2\otimes \widehat{\mat{\Omega}}_2)$ is asymptotically normal and
% Given $r$ matrices $\mat{A}_k$ of dimension $p_j\times q_j$ for $k = 1, \ldots, r$, then there exists a unique permutation matrix $\mat{S}_{\mat{p}, \mat{q}}$ such that
% The permutation $\mat{S}_{\mat{p}, \mat{q}}$ with indices $\mat{p} = (p_1, \ldots, p_r)$ and $\mat{q} = (q_1, \ldots, q_r)$ is the matrix-matrix product of $r - 1$ permutation matrices given by
% where $\mat{K}_{p, q}$ is the \emph{commutation matrix} from \textcite[Ch.~11]{MatrixAlgebra-AbadirMagnus2005}, that is the permutation such that $\vec{\t{\mat{A}}} = \mat{K}_{p, q}\vec{\mat{A}}$ for every $p\times q$ dimensional matrix $\mat{A}$.
% \end{lemma}
% \begin{proof}
% \textcite[Lemma~7]{SymMatandJacobians-MagnusNeudecker1986} states that
% This proves the statement for $r = 2$. The general statement for $r > 2$ follows via induction using \textcite[Lemma~7]{SymMatandJacobians-MagnusNeudecker1986} in conjunction with $\vec(\mat{C}\mat{a}\t{\mat{b}}) = (\mat{I}_{\dim(\mat{b})}\otimes\mat{C})\vec(\mat{a}\t{\mat{b}})$.
% \end{proof}
\begin{lemma}\label{thm:kron-perm}
Given $r \geq2$ matrices $\mat{A}_k$ of dimension $p_j\times q_j$ for $k =1, \ldots, r$, then there exists a unique permutation matrix $\mat{S}_{\mat{p}, \mat{q}}$ such that
where $\mat{K}_{p, q}$ is the \emph{commutation matrix} from \textcite[Ch.~11]{MatrixAlgebra-AbadirMagnus2005}, that is the permutation such that $\vec{\t{\mat{A}}}=\mat{K}_{p, q}\vec{\mat{A}}$ for every $p\times q$ dimensional matrix $\mat{A}$.
\end{lemma}
\begin{proof}
\textcite[Lemma~7]{SymMatandJacobians-MagnusNeudecker1986} states that
This proves the statement for $r =2$. The general statement for $r > 2$ follows via induction. Assuming \eqref{eq:kron-to-outer-perm} holds for $r -1$, the induction step is then;
\begin{multline*}
\vec{\bigkron_{k = r}^{1}}\mat{A}_k
= \vec\Bigl(\mat{A}_r\otimes\bigkron_{k = r - 1}^{1}\mat{A}_k\Bigr)
Equality $(a)$ uses the relation $\vec(\mat{C}\mat{a}\t{\mat{b}})=(\mat{I}_{\dim(\mat{b})}\otimes\mat{C})\vec(\mat{a}\t{\mat{b}})$ for a matrix $\mat{C}$ and vectors $\mat{a}, \mat{b}$.
\end{proof}
% \begin{remark}
% \todo{simplification of $\mat{S}$ if all dimensions are equal, that is if $p_k = p_j$ and $p_k = q_k$ for all $k, j$?!}
% \end{remark}
\begin{remark}
The permutation matrix $\mat{K}_{p, q}$ represents a perfect outer $p$-shuffle of $p q$ elements.
\begin{proof}[Proof of \cref{thm:sdr}]\label{proof:sdr}
A direct implication of \textcite[Theorem~1]{sdr-BuraDuarteForzani2016} is that, under the exponential family \eqref{eq:quadratic-exp-fam} with natural statistic \eqref{eq:t-stat},
is a sufficient reduction, where $\mat{\alpha}\in\mathbb{R}^{(p + d)\times q}$ with $\Span(\mat{\alpha})=\Span(\{\mat{\eta}_y -\E_{Y}\mat{\eta}_Y : y\in\mathcal{S}_Y\})$. Since $\E_Y\ten{F}_Y =0$, $\E_Y\mat{\eta}_{1 Y}=\E[\vec\overline{\ten{\eta}}-\mat{B}\vec\ten{F}_Y]=\vec\overline{\ten{\eta}}$. Thus,
is also a sufficient reduction, though not necessarily minimal, using $\mat{B}=\bigkron_{k =1}^{r}\mat{\beta}_k$. When the exponential family is full rank, which in our setting amounts to all $\mat{\beta}_j$ being full rank matrices, $j=1,\ldots,r$, then \textcite[Thm~1]{sdr-BuraDuarteForzani2016} also obtains the minimality of the reduction.
\end{proof}
\todo{check proof of Thm 2}
\begin{proof}[Proof of \cref{thm:grad}]\label{proof:grad}
We first note that for any exponential family with density \eqref{eq:quad-density} the term $b(\mat{\eta}_{y_i})$ differentiated with respect to the natural parameter $\mat{\eta}_{y_i}$ is the expectation of the statistic $\mat{t}(\ten{X})$ given $Y = y_i$. In our case we get $\nabla_{\mat{\eta}_{y_i}}b =(\nabla_{\mat{\eta}_{1{y_i}}}b, \nabla_{\mat{\eta}_2}b)$ with components
\begin{displaymath}
\nabla_{\mat{\eta}_{1{y_i}}}b
= \E[\mat{t}_1(\ten{X})\mid Y = y_i]
= \vec\E[\ten{X}\mid Y = y_i]
= \vec\ten{g}_1(\mat{\eta}_{y_i})
\end{displaymath}
and
\begin{align*}
\nabla_{\mat{\eta}_{2}}b
&= \E[\mat{t}_2(\ten{X})\mid Y = y_i]
= \E[\mat{T}_2\vech((\vec\ten{X})\t{(\vec\ten{X})})\mid Y = y_i]\\
&= \E[\mat{T}_2\pinv{\mat{D}_p}\vec(\ten{X}\circ\ten{X})\mid Y = y_i]
The gradients are related to their derivatives by transposition, $\nabla_{\mat{\eta}_{1{y_i}}}b =\t{\D b(\mat{\eta}_{1{y_i}})}$ and $\nabla_{\mat{\eta}_2}b =\t{\D b(\mat{\eta}_2)}$.
Next we provide the differentials of the natural parameter components from \eqref{eq:eta1} and \eqref{eq:eta2} in a quite direct form ,without any further ``simplifications'', because the down-stream computations won't benefit from reexpressing the following
\todo{in terms of $m_{\mat{\theta}}$, meaning without the sum, makes it a bit nicer!}
All other combinations, namely $\d\mat{\eta}_{1{y_i}}(\mat{\Omega}_j)$, $\d\mat{\eta}_2(\overline{\ten{\eta}})$ and $\d\mat{\eta}_2(\mat{\beta}_j)$, are zero.
Continuing with the partial differentials of $l_n$ from \eqref{eq:log-likelihood}
where the last equality holds because $\mat{N}_p =\mat{D}_p\pinv{\mat{D}_p}$ is the symmetrizer matrix from \textcite[Ch. 11]{MatrixAlgebra-AbadirMagnus2005}. For the symmetrizer matrix $\mat{N}_p$ holds $\mat{N}_p\vec{\mat{A}}=\vec{\mat{A}}$ if $\mat{A}=\t{\mat{A}}$, while
\begin{displaymath}
\vec{\ten{g}_2(\mat{\eta}_y)} = \vec\E[\ten{X}\circ\ten{X}\mid Y = y] = \vec\E[(\vec{\ten{X}})\t{(\vec{\ten{X}})}\mid Y = y]
\end{displaymath}
is the vectorization of a symmetric matrix.
\end{proof}
\begin{proof}[Proof of \cref{thm:kron-manifolds}]\label{proof:kron-manifolds}
We start by considering the first case and assume that $\manifold{B}$ is spherical with radius $1$ w.l.o.g. We equip $\manifold{K}=\{\mat{A}\otimes\mat{B} : \mat{A}\in\manifold{A}, \mat{B}\in\manifold{B}\}\subset\mathbb{R}^{p_1 p_2\times q_1 q_2}$ with the subspace topology \efi{cite{?}}. Define the hemispheres $H_i^{+}=\{\mat{B}\in\manifold{B} : (\vec{\mat{B}})_i > 0\}$ and $H_i^{-}=\{\mat{B}\in\manifold{B} : (\vec{\mat{B}})_i < 0\}$ for $i =1, ..., p_2 q_2$. The hemispheres are an open cover of $\manifold{B}$ with respect to the subspace topology. Define for every $H_i^{\pm}$, where $\pm$ is a placeholder for ether $+$ or $-$, the function
which is smooth. With the spherical property of $\manifold{B}$ the relation $\|\mat{A}\otimes\mat{B}\|_F =\|\mat{A}\|_F$ for all $\mat{A}\otimes\mat{B}\in\manifold{K}$ ensures that the function $f_{H_i^{\pm}}$, constrained to its image, is bijective with inverse function (identifying $\mathbb{R}^{p\times q}$ with $\mathbb{R}^{p q}$) given by
where $\pm$ is $+$ for a ``positive'' hemisphere $H_i^{+}$ and $-$ otherwise, $\mat{e}_i\in\mathbb{R}^{p_2 q_2}$ is the $i$th unit vector and $\mat{R}(\mat{C})$ is a ``reshaping'' permutation \footnote{Relating to $\K$ the operation $\mat{R}$ is basically its inverse as $\K(\mat{A}\circ\mat{B})=\mat{A}\otimes\mat{B}$ with a mismatch in the shapes only.} which acts on Kronecker products as $\mat{R}(\mat{A}\otimes\mat{B})=(\vec{\mat{A}})\t{(\vec{\mat{B}})}$. This makes $f_{H_i^{\pm}}^{-1}$ a combination of smooth functions ($\mat{0}$ is excluded from $\manifold{A}, \manifold{B}$ guarding against division by zero) and as such it is also smooth. This ensures that $f_{H_i^{\pm}} : \manifold{A}\times{H_i^{\pm}}\to f_{H_i^{\pm}}(\manifold{A}\times{H_i^{\pm}})$ is a diffeomorphism.
Next, we construct an atlas\footnote{A collection of charts $\{\varphi_i : i\in I \}$ with index set $I$ of a manifold $\manifold{A}$ is called an \emph{atlas} if the pre-images of the charts $\varphi_i$ cover the entire manifold $\manifold{A}$.} for $\manifold{K}$ which is equipped with the subspace topology. Let $(\varphi_j, U_j)_{j\in J}$ be a atlas of $\manifold{A}\times\manifold{B}$. Such an atlas exists and admits a unique smooth structure as both $\manifold{A}, \manifold{B}$ are embedded manifolds from which we take the product manifold. The images of the coordinate domains $f_H(U_j)$ are open in $\manifold{K}$, since $f_H$ is a diffeomorphism, with the corresponding coordinate maps
By construction the set $\{\phi_{H_i^{\pm},j} : i =1, ..., p_2 q_2, \pm\in\{+, -\}, j\in J \}$ is an atlas if the charts are compatible. This means we need to check if the transition maps are diffeomorphisms, let $(\phi_{H, j}, V_j), (\phi_{\widetilde{H}, k}, V_k)$ be two arbitrary charts from our atlas, then the transition map $\phi_{\widetilde{H}, k}\circ\phi_{H,j}^{-1}:\phi_{H,j}^{-1}(V_j\cap V_k)\to\phi_{\widetilde{H},k}^{-1}(V_j\cap V_k)$ has the form
where $\pm$ depends on $H, \widetilde{H}$ being of the same ``sign'' and $\mathrm{id}$ is the identity. We conclude that the charts are compatible, which makes the constructed set of charts an atlas. With that we have shown the topological manifold $\manifold{K}$ with the subspace topology admit a smooth atlas that makes it an embedded smooth manifold with dimension equal to the dimension of the product topology $\manifold{A}\times\manifold{B}$; that is, $d =\dim\manifold{A}+\dim\manifold{B}$.
It remains to show that the cone condition also admits a smooth manifold. $\manifold{K}=\{\mat{A}\otimes\mat{B} : \mat{A}\in\manifold{A}, \mat{B}\in\widetilde{B}\}$, where $\widetilde{B}=\{\mat{B}\in\manifold{B} : \|\mat{B}\|_F =1\}$, holds if both $\manifold{A}, \manifold{B}$ are cones. Since $g:\manifold{B}\to\mathbb{R}:\mat{B}\mapsto\|\mat{B}\|_F$ is continuous on $\manifold{B}$ with full rank $1$ everywhere, $\widetilde{\manifold{B}}= g^{-1}(1)$ is a $\dim{\manifold{B}}-1$ dimensional embedded submanifold of $\manifold{B}$. An application of the spherical case proves the cone case.
\end{proof}
\begin{proof}[Proof of \cref{thm:param-manifold}]
An application of \cref{thm:kron-manifold-tangent-space} ensures that $\manifold{K}_{\mat{B}}$ and $\manifold{K}_{\mat{\Omega}}$ are embedded submanifolds.
With $\mat{T}_2$ being a $d\times p(p +1)/2$ full rank matrix and the duplication matrix $\mat{D}_p$ is full rank of dimension $p(p +1)/2\times p^2$ we have $\mat{T}_2\pinv{\mat{D}_p}$ to be $d\times p^2$ of full rank. This means that $\mat{P}=\pinv{(\mat{T}_2\pinv{\mat{D}_p})}\mat{T}_2\pinv{\mat{D}_p}$ is a $p^2\times p^2$ projection of rank $d$ and $\mat{I}_{p^2}-\mat{P}$ is then a projection of rank $p^2- d$. This leads to
To show that $\manifold{CK}_{\mat{\Omega}}$ is an embedded submanifold of $\manifold{K}_{\mat{\Omega}}$ we apply the ``Constant-Rank Level Set Theorem'' \textcite[Thm~5.12]{introToSmoothMani-Lee2012} which states (slightly adapted) the following;
Let $\manifold{A}$, $\manifold{B}$ be smooth manifolds and $F:\manifold{A}\to\manifold{B}$ a smooth map such that $\nabla_{\mat{a}} F$ has constant matrix rank for all $\mat{a}\in\manifold{A}$. Then, for every $\mat{b}\in F(\mat{A})\subseteq\manifold{B}$, the preimage $F^{-1}(\{\mat{b}\})$ is a smooth embedded submanifold of $\manifold{A}$.
In our setting, we have $F:\manifold{K}_{\mat{\Omega}}\to\mathbb{R}^{p^2}$ defined as $F(\mat{\Omega})=(\mat{I}_{p^2}-\mat{P})\vec{\mat{\Omega}}$ with gradient $\nabla_{\mat{\Omega}} F =\mat{I}_{p^2}-\mat{P}$ of constant rank. Therefore, $F^{-1}(\{\mat{0}\})=\manifold{CK}_{\mat{\Omega}}$ is an embedded submanifold of $\manifold{K}_{\mat{\Omega}}$.
Finally, the finite product manifold of embedded submanifolds is embedded in the finite product space of their ambient spaces, that is $\Theta=\mathbb{R}^p \times\manifold{K}_{\mat{B}}\times\manifold{CK}_{\mat{\Omega}}\subset\mathbb{R}^p\times\mathbb{R}^{p\times q}\times\mathbb{R}^{p\times p}$ is embedded.
\end{proof}
\begin{proof}[Proof of \cref{thm:exists-strong-M-estimator-on-subsets}]
Let $\hat{\mat{\xi}}_n$ be a (weak/strong) M-estimator for the unconstrained problem. This gives by definition, in any case, that
Cause $\emptyset\neq\Theta\subseteq\Xi$ we have $\sup_{\mat{\theta}\in\Theta} M_n(\mat{\theta})\leq\sup_{\mat{\xi}\in\Xi} M_n(\mat{\xi})$ and with $M_n(\mat{\xi}) < \infty$ for any $\mat{\xi}\in\Xi$
If $\sup_{\mat{\theta}\in\Theta} M_n(\mat{\theta}) < \infty$, then, for any $0 < \epsilon_n$ exists $\hat{\mat{\theta}}_n\in\Theta$ such that $\sup_{\mat{\theta}\in\Theta} M_n(\mat{\theta})-\epsilon_n \leq M_n(\hat{\mat{\theta}}_n)$. Therefore, we can choose $\epsilon_n\in o(n^{-1})$, which yields
which is the definition of $\hat{\mat{\theta}}_n$ being a strong M-estimator over $\Theta$.
\end{proof}
\begin{proof}[Proof of \cref{thm:M-estimator-consistency-on-subsets}]
It follows the proof of \textcite[Proposition~2.4]{asymptoticMLE-BuraEtAl2018} with the same assumptions. The only exception is we only require $\Theta$ to be a subset of $\Xi$. This is accounted for by replacing Lemma~2.3 in \textcite{asymptoticMLE-BuraEtAl2018} with \cref{thm:exists-strong-M-estimator-on-subsets} to obtain the existence of a strong M-estimator on $\Theta$.
\end{proof}
\begin{proof}[Proof of \cref{thm:M-estimator-asym-normal-on-manifolds}]
Let $\varphi:U\to\varphi(U)$ be a coordinate chart\footnote{By \cref{def:manifold}, the chart $\varphi : U\to\varphi(U)$ is bi-continuous, is infinitely often continuously differentiable, and has a continuously differentiable inverse $\varphi^{-1} : \varphi(U)\to U$. Furthermore, the domain $U$ is open according to the trace topology on $\Theta$, that means that their exists an open set $O\subseteq\mathbb{R}^p$ such that $U =\Theta\cap O$.} with $\mat{\theta}_0\in U\subseteq\Theta$. As $\varphi$ is continuous we get with the continuous mapping theorem on metric spaces \textcite[Thm~18.11]{asymStats-van_der_Vaart1998} that $\varphi(\hat{\mat{\theta}}_n)\xrightarrow{p}\varphi(\mat{\theta}_0)$ which implies $P(\varphi(\hat{\mat{\theta}}_n)\in\varphi(U))\xrightarrow{n\to\infty}1$.
The next step is to apply \textcite[Thm~5.23]{asymStats-van_der_Vaart1998} to $\hat{\mat{s}}_n =\varphi(\hat{\mat{\theta}}_n)$. Therefore, assume that $\hat{\mat{s}}_n\in\varphi(U)$. Denote with $\mat{s}=\varphi(\mat{\theta})\in\varphi(U)\subseteq\mathbb{R}^d$ the coordinates of the parameter $\mat{\theta}\in U\subseteq\Theta$ of the $d =\dim(\Theta)$ dimensional manifold $\Theta\subseteq\mathbb{R}^p$. With $\varphi : U\to\varphi(U)$ being bijective, we can express $m_{\mat{\theta}}$ in terms of $\mat{s}=\varphi(\mat{\theta})$ for every $\mat{\theta}\in U$ as $m_{\mat{\theta}}= m_{\varphi^{-1}(\mat{s})}$. Furthermore, denote
\caption{\label{fig:proof:M-estimator-asym-normal-on-manifolds}Depiction ot the notation used in the proof of \cref{thm:M-estimator-asym-normal-on-manifolds}. Example with $p =3$ and $d =\dim(\Theta)=2$.}
\end{figure}
By assumption, the function $M(\mat{\theta})$ is twice continuously differentiable in an neighborhood\footnote{A set $N$ is called a neighborhood of $u$ if there exists an open set $O$ such that $u\in O\subseteq N$.} of $\mat{\theta}_0$. W.l.o.g. we can assume that $U$ is contained in that neighborhood. Then, using the chain rule, we get the gradient of $M_{\varphi}$ at $\mat{s}_0$ to be $\mat{0}$ by
because $\mat{\theta}_0=\varphi^{-1}(\mat{s}_0)$ is a maximizer of $M$. For the second-derivative, evaluated at $\mat{s}_0=\varphi(\mat{\theta}_0)$, we have
We also need to check the local Lipschitz condition of $m_{\varphi^{-1}(\mat{s})}$. Therefore, let $V_{\epsilon}(\mat{s}_0)=\{\mat{s}\in\mathbb{R}^d : \|\mat{s}-\mat{s}_0\| < \epsilon\}$ be the open $\epsilon$-ball with center $\mat{s}_0$. Since $\varphi(U)$ contains $\mat{s}_0$, and is open in $\mathbb{R}^d$, there exists an $\epsilon > 0$ such that $V_{\epsilon}(\mat{s}_0)\subseteq\varphi(U)$. Then, the closed $\epsilon/2$ ball $\overline{V}_{\epsilon/2}(\mat{s}_0)$ is a neighborhood of $\mat{s}_0$ and the supremum $\sup_{\mat{s}\in\overline{V}_{\epsilon/2}(\mat{s}_0)}\|\nabla\varphi^{-1}(\mat{s})\| < \infty$ due to the continuouty of $\nabla\varphi^{-1}$ on $\varphi(U)$ with $\overline{V}_{\epsilon/2}(\mat{s}_0)\subset V_{\epsilon}(\mat{s}_0)\subseteq\varphi(U)$. Then, for almost every $z$ and every $\mat{s}_1=\varphi(\mat{\theta}_1), \mat{s}_2=\varphi(\mat{\theta}_2)\in\overline{V}_{\epsilon/2}(\mat{s}_0)$ holds
Here, $(a)$ holds by assumption and $(b)$ is a result of the mean value theorem. Now, $v(z)$ is measurable and square integrable as a scaled version of $u(z)$. Finally, with $\varphi$ being one-to-one, we get that $\hat{\mat{s}}_n =\varphi(\hat{\mat{\theta}}_n)$ is a strong M-estimator for $\mat{s}_0=\varphi(\mat{\theta}_0)$ of the objective $M_{\varphi}$. Now, we apply \textcite[Thm~5.23]{asymStats-van_der_Vaart1998} to get the asymptotic normality of $\hat{\mat{s}}_n$ as
We continue by reexpressing the $p\times p$ asymtotic variance-covariance matrix of $\hat{\mat{\theta}}_n$ in terms of $\mat{\theta}_0$ instead of $\mat{s}_0=\varphi(\mat{\theta}_0)$. Therefore, let $\PP=\t{\nabla\varphi^{-1}(\varphi(\mat{\theta}_0))}=\t{\nabla\varphi^{-1}(\mat{s}_0)}$ and observe that for all $\mat{s}\in\varphi(U)$, the gradient of $\mat{s}\mapsto m_{\varphi^{-1}(\mat{s})}(z)$ evaluated at $\mat{s}_0=\varphi(\mat{\theta}_0)$ has the form
where the last equality holds by $\Span\PP= T_{\mat{\theta}_0}\Theta$ by \cref{def:tangent-space} of the tangent space $T_{\mat{\theta}_0}\Theta$.
It remains to show that $\mat{\Pi}_{\mat{\theta}_0}=\mat{P}_{\mat{\theta}_0}\pinv{(\t{\mat{P}_{\mat{\theta}_0}}\mat{H}_{\mat{\theta}_0}\mat{P}_{\mat{\theta}_0})}\t{\mat{P}_{\mat{\theta}_0}}$ for any $p\times k$ matrix $\mat{P}_{\mat{\theta}_0}$ such that $k\geq d$ and $\Span{\mat{P}_{\mat{\theta}_0}}= T_{\mat{\theta}_0}\Theta$. This also ensures that the final result is independent of the chosen chart $\varphi$, since the tangent space does not depend on a specific chart. Therefore, let $\PP={\mat{Q}}{\mat{R}}$ and $\mat{P}_{\mat{\theta}_0}=\widetilde{\mat{Q}}\widetilde{\mat{R}}$ be their thin QR decompositions, respectively. Both $\mat{Q}, \widetilde{\mat{Q}}$ have dimension $p\times d$ With $\mat{Q}$ being semi-orthogonal, $\mat{R}$ is invertible of dimension $d\times d$ while $\widetilde{\mat{R}}$ is a $d\times k$ full row-rank matrix. With $\mat{Q}$ being semi-orthogonal the $p\times p$ matrix $\mat{Q}\t{\mat{Q}}$ is an orthogonal projection onto $\Span\mat{Q}=\Span\mat{P}_{\mat{\theta}_0}= T_{\mat{\theta}_0}\Theta$. This allows to express $\mat{P}_{\mat{\theta}_0}$ in terms of $\mat{Q}$ as
From $\Span\mat{Q}=\Span\mat{P}_{\mat{\theta}_0}$ follows that the $d\times k$ matrix $\mat{M}$ is also of full row-rank. We get $\mat{M}\pinv{\mat{M}}=\mat{I}_d =\mat{R}\mat{R}^{-1}$ as a property of the Moore-Penrose pseudo inverse with $\mat{M}$ being of full row-rank. Another property of the pseudo inverse is that for matrices $\mat{A}, \mat{B}$, where $\mat{A}$ has full column-rank and $\mat{B}$ has full row-rank, holds $\pinv{(\mat{A}\mat{B})}=\pinv{\mat{B}}\pinv{\mat{A}}$. This enables the computation
In the following we rewrite the log-likelihood \eqref{eq:log-likelihood} in a simpler form. This simplifies the proof of \cref{thm:asymptotic-normality-gmlm} as well as provides the notation to express the regularity conditions of \cref{thm:asymptotic-normality-gmlm} in a compact form.
Rewriting the first natural parameter component $\mat{\eta}_{1y}$ defined in \eqref{eq:eta1-manifold} gives
describing the linear relation between $\mat{\eta}_2$ and $\vech{\mat{\Omega}}$. This gives the following relation between $\mat{\eta}_y =(\mat{\eta}_{1y}, \mat{\eta}_2)$ and $\mat{\xi}=(\vec{\overline{\ten{\eta}}}, \vec{\mat{B}}, \vech{\mat{\Omega}})\in\Xi$ as
where $\mat{F}(y)$ is a $(p + d)\times p (p +2 q +3)/2$ dimensional matrix valued function in $y$. Moreover, for every $y$ the matrix $\mat{F}(y)$ is of full rank $p + d$.
The log-likelihood of model \eqref{eq:quad-density} for the unconstrained parameters $\xi\in\Xi$ is
The mapping $\mat{\xi}\mapsto m_{\mat{\xi}}(z)$ is twice continuously differentiable for almost every $z$ and $z\mapsto m_{\mat{\xi}}(z)$ is measurable. Moreover, $\mat{\eta}\mapsto b(\mat{\eta})$ is strictly convex. \todo{Furthermore, for every $\widetilde{\mat{\eta}}$ holds $P(\mat{F}(Y)\mat{\xi}=\widetilde{\mat{\eta}}) < 1$. Do I need this???}
\end{condition}
\begin{condition}\label{cond:moments}
It holds $\E\|\t{\mat{t}(\ten{X})}\mat{F}(Y)\| < \infty$ and $\E\|\t{\mat{t}(\ten{X})}\mat{F}(Y)\|^2 < \infty$.
Let $\manifold{A}_k\subseteq\mathbb{R}^{p_k\times q_k}\backslash\{\mat{0}\}$ for $k =1, \ldots, r$ be smooth embedded submanifolds as well as ether a sphere or a cone. Then
and let $\gamma_j$ be $p_j q_j\times d_j$ matrices with $d_j \geq\dim\manifold{A}_j$ which span the tangent space $T_{\mat{A}_j}\manifold{A}_j$ of $\manifold{A}$ at $\mat{A}_j\in\manifold{A}_j$, that is $\Span\gamma_j = T_{\mat{A}_j}\manifold{A}_j$.
Then, with the permutation matrix $\mat{S}_{\mat{p}, \mat{q}}$ defined in \eqref{eq:S_pq}, the $p q \times\sum_{k =1}^{r} d_j$ dimensional matrix
% Then, with $\mat{p} = (p_1, \ldots, p_r)$, $\mat{q} = (q_1, \ldots, q_r)$ and $\mat{S}_{\mat{p}, \mat{q}}$ being the permutation matrix from \cref{thm:kron-perm}, the $p q \times \sum_{k = 1}^{r} d_j$ dimensional matrix
spans the tangent space $T_{\mat{A}}\manifold{K}$ of $\manifold{K}$ at $\mat{A}=\bigkron_{k = r}^{1}\mat{A}_k\in\manifold{K}$, in formula $\Span\mat{P}_{\mat{A}}= T_{\mat{A}}\manifold{K}$.
\end{lemma}
\begin{proof}
% The statement that $\manifold{K}$ is an embedded manifold as well as its dimension follows via induction using \cref{thm:kron-manifolds}.
The statement that $\manifold{K}$ is an embedded manifold follows via induction using \cref{thm:kron-manifolds}.
We compute the differential of the vectorized Kronecker product using \cref{thm:kron-perm} where $\mat{S}_{\mat{p}, \mat{q}}$ is the permutation \eqref{eq:S_pq} defined therein.
Due to the definition of the manifold this differential provides the gradient of a surjective map into the manifold. The span of the gradient then spans the tangent space.
Now, we take a closer look at the differentials $\vec{\d\mat{A}_j}$ for $j =1, \ldots, r$. Let $\varphi_j$ be a chart of $\manifold{A}_j$ in a neighborhood of $\mat{A}_j$. Then, $\mat{A}_j =\varphi_j^{-1}(\varphi_j(\mat{A}_j))$ which gives
Therefore, for every matrix $\mat{\gamma}_j$ such that $\Span{\mat{\gamma}_j}= T_{\mat{A}_j}\manifold{A}_j$ holds $\Span{\t{\nabla\varphi_j^{-1}(\varphi_j(\mat{A}_j))}}=\Span{\mat{\gamma}_j}$ by \cref{def:tangent-space} of the tangent space. We get
\begin{proof}[Proof of \cref{thm:asymptotic-normality-gmlm}]
The proof consists of three parts. First, we show the existence of a consistent strong M-estimator by applying \cref{thm:M-estimator-consistency-on-subsets}. Next, we apply \cref{thm:M-estimator-asym-normal-on-manifolds} to obtain its asymptotic normality. We conclude by computing the missing parts of the asymtotic covariance matrix $\mat{\Sigma}_{\mat{\theta}_0}$ provided by \cref{thm:M-estimator-asym-normal-on-manifolds}.
We check whether the conditions of \cref{thm:M-estimator-consistency-on-subsets} are satisfied. On $\Xi$, the mapping $\mat{\xi}\mapsto m_{\mat{\xi}}(z)= m_{\mat{\xi}}(\ten{X},y)=\langle\mat{F}(y)\mat{\xi}, \mat{t}(\ten{X})\rangle- b(\mat{F}(y)\mat{\xi})$ is strictly concave for every $z$ because $\mat{\xi}\mapsto\mat{F}(y)\mat{\xi}$ is linear and $b$ is strictly convex by \cref{cond:differentiable-and-convex}. Since $\ten{X}\mid Y$ is distributed according to \eqref{eq:quadratic-exp-fam}, the function $M(\mat{\xi})=\E m_{\mat{\xi}}(Z)$ is well defined by \cref{cond:moments}. Let $\mat{\xi}_k =(\vec{\overline{\ten{\eta}}_k}, \vec{\mat{B}_k}, \vech{\mat{\Omega}_k})$, and $f_{\mat{\xi}_k}$ be the pdf of $\ten{X}\mid Y$ indexed by $\mat{\xi}_k$, for $k =1, 2$. If $\mat{\xi}_1\ne\mat{\xi}_2$, then $f_{\mat{\xi}_1}\neq f_{\mat{\xi}_2}$, which obtains that the true $\mat{\theta}_0$ is a unique maximizer of $\mat{\theta}_0\in\Theta\subseteq\Xi$ by applying \textcite[Lemma~5.35]{asymStats-van_der_Vaart1998}. Finally, under \cref{cond:finite-sup-on-compacta}, all assumptions of \cref{thm:M-estimator-consistency-on-subsets} are fulfilled yielding the existence of an consistent strong M-estimator over $\Theta\subseteq\Xi$.
Next, let $\hat{\mat{\theta}}_n$ be a strong M-estimator on $\Theta\subseteq\Xi$, whose existence and consistency was shown in the previous step. Since $z\mapsto m_{\mat{\xi}}(z)$ is measurable for all $\mat{\xi}\in\Xi$, it is also measurable in a neighborhood of $\mat{\theta}_0$. The differentiability of $\mat{\theta}\mapsto m_{\mat{\theta}}(z)$ is stated in \cref{cond:differentiable-and-convex}. For the Lipschitz condition, let $K\subseteq\Xi$ be a compact neighborhood of $\mat{\theta}_0$, which exists since $\Xi$ is open. Then,
with $u(z)$ being measurable and square integrable derives from \cref{cond:finite-sup-on-compacta}. The existence of a second-order Taylor expansion of $\mat{\theta}\mapsto M(\mat{\theta})=\E m_{\mat{\theta}}(Z)$ in a neighborhood of $\mat{\theta}_0$ holds by \cref{cond:finite-sup-on-compacta}. Moreover, the Hessian $\mat{H}_{\mat{\theta}_0}$ is non-singular by the strict convexity of $b$ stated in \cref{cond:differentiable-and-convex}. Now, we can apply \cref{thm:M-estimator-asym-normal-on-manifolds} to obtain the asymptotic normality of $\sqrt{n}(\hat{\mat{\theta}}_n -\mat{\theta}_0)$ with variance-covariance structure
where $\mat{\Pi}_{\mat{\theta}_0}=\mat{P}_{\mat{\theta}_0}(\t{\mat{P}_{\mat{\theta}_0}}\mat{H}_{\mat{\theta}_0}\mat{P}_{\mat{\theta}_0})^{-1}\t{\mat{P}_{\mat{\theta}_0}}$ and $\mat{P}_{\mat{\theta}_0}$ is any $p\times\dim(\Theta)$ matrix such that it spans the tangent space of $\Theta$ at $\mat{\theta}_0$. That is, $\Span\mat{P}_{\mat{\theta}_0}= T_{\mat{\theta}_0}\Theta$.
Finally, we compute a matrix $\mat{P}_{\mat{\theta}_0}$ such that $\Span{\mat{P}_{\mat{\theta}_0}}= T_{\mat{\theta}_0}\Theta$ for $\Theta=\mathbb{R}^p\times\manifold{K}_{\mat{B}}\times\manifold{CK}_{\mat{\Omega}}$ as in \cref{thm:param-manifold}. Since the manifold $\Theta$ is a product manifold we get a block diagonal structure for $\mat{P}_{\mat{\theta}_0}$ as
\begin{displaymath}
\mat{P}_{\mat{\theta}_0} = \begin{pmatrix}
\mat{I}_p & 0 & 0 \\
0 &\mat{P}_{\mat{B}_0}& 0 \\
0 & 0 &\mat{P}_{\mat{\Omega}_0}
\end{pmatrix}
\end{displaymath}
where $\mat{I}_p$ is the identity matrix spanning the tangent space of $\mathbb{R}^p$, which is identified with $\mathbb{R}^p$ itself. The blocks $\mat{P}_{\mat{B}_0}$ and $\mat{P}_{\mat{\Omega}_0}$ need to span the tangent spaces of $\manifold{K}_{\mat{B}}$ and $\manifold{CK}_{\mat{\Omega}}$, respectively. Both $\manifold{K}_{\mat{B}}$ and $\manifold{CK}_{\mat{\Omega}}$ are manifolds according to \cref{thm:kron-manifolds} under the cone condition. The constraint manifold $\manifold{CK}_{\mat{\Omega}}$ is the intersection of $\manifold{K}_{\mat{\Omega}}$ with the span of the projection $\pinv{(\mat{T}_2\pinv{\mat{D}_p})}\mat{T}_2\pinv{\mat{D}_p}$ meaning that the differential $\vec{\d\mat{\Omega}}$ on $\manifold{CK}_{\mat{\Omega}}$ fulfills $\vec{\d\mat{\Omega}}=\pinv{(\mat{T}_2\pinv{\mat{D}_p})}\mat{T}_2\pinv{\mat{D}_p}\vec{\d\mat{\Omega}}$. Now, we can apply \cref{thm:kron-manifold-tangent-space} for $\manifold{K}_{\mat{B}}$ and $\manifold{K}_{\mat{\Omega}}$ which give
where the matrices $\mat{S}_{\mat{p}, \mat{q}}$, $\mat{\Gamma}_{\mat{\beta}_j}$, $\mat{\gamma}_{\mat{\beta}_j}$, $\mat{\Gamma}_{\mat{\Omega}_j}$ and $\mat{\gamma}_{\mat{\Omega}_j}$ are described in \cref{thm:kron-manifold-tangent-space} for the Kronecker manifolds $\manifold{K}_{\mat{B}}$ and $\manifold{K}_{\mat{\Omega}}$. Leading to