


@ 142,13 +142,12 @@





\newcommand{\DiagZeroMat}[1]{\mathbb{R}_{\diag=0}^{{#1}\times {#1}}}





\newcommand{\SymMat}[1]{\mathrm{Sym}^{{#1}\times {#1}}}





\newcommand{\SkewSymMat}[1]{\mathrm{Skew}^{{#1}\times {#1}}}





% \newcommand{\SymPosDefMat}[1]{\mathrm{Sym}_{++}^{{#1}\times {#1}}}





\newcommand{\SymPosDefMat}[1]{\mathrm{SPD}(#1)}





\newcommand{\SymPosDefMat}[1]{\mathrm{Sym}_{++}^{{#1}\times {#1}}}





\newcommand{\SymDiagZeroMat}[1]{\mathrm{Sym}_{\diag=0}^{p\times p}}





\newcommand{\SymBand}[2]{\mathrm{SymBand}^{{#1}\times {#1}}_{#2}}





\newcommand{\UnitaryGrp}[1]{\mathrm{U}(#1)}





\newcommand{\SpecialUnitaryGrp}[1]{\mathrm{SU}(#1)}





% \newcommand{\AR}[1]{\mathrm{AR}(#1)}















\newcommand{\todo}[1]{{\color{red}TODO: #1}}





\newcommand{\efi}[1]{{\color{blue}Effie: #1}}




@ 481,10 +480,12 @@ where $\overline{\ten{\eta}}\in\mathbb{R}^{p_1\times\ldots\times p_r}$, $\mat{\O





\end{align}





where $\mat{B} = \bigotimes_{j = r}^{1}\mat{\beta}_j$ and the component matrices $\mat{\beta}_j\in\mathbb{R}^{p_j\times q_j}$ are of known rank for $j = 1, \ldots, r$. Given the high potential value of $p$, we further assume that





\begin{align}





\t{(\mat{T}_2\pinv{\mat{D}_p})}\mat{\eta}_{2y}= \t{(\mat{T}_2\pinv{\mat{D}_p})}\mat{\eta}_{2} &= \vec\biggl(c\bigotimes_{j = r}^{1}\mat{\Omega}_j\biggr). \label{eq:eta2}





\t{(\mat{T}_2\pinv{\mat{D}_p})}\mat{\eta}_{2y}= \t{(\mat{T}_2\pinv{\mat{D}_p})}\mat{\eta}_{2} &= \vec\biggl(c\bigotimes_{j = r}^{1}\mat{\Omega}_j\biggr). \label{eq:eta2}





\end{align}





where $\mat{\Omega}_j\in\mathbb{R}^{p_j\times p_j}$ are symmetric positive definite matrices for $j = 1, \ldots, r$. That is, we require $\mat{\Omega} = \bigotimes_{j = r}^{1}\mat{\Omega}_j$, which substantially reduces the number of parameters to estimate in $\mat{\Omega}$. The assumption that the $\mat{\Omega}_j$'s be positive definite is possible due to the constant $c$.





Equation \eqref{eq:eta2} is underdetermined since $\t{(\mat{T}_2\pinv{\mat{D}_p})}$ has full column rank $d < p^2$ (with a nonstrict inequality if $\ten{X}$ is univariate) but $\mat{\eta}_2$ is uniquely determined given any $\mat{\Omega}$ as $\t{(\pinv{(\mat{T}_2\pinv{\mat{D}_p})})}$ has full row rank. We let $\mat{\xi} = (\vec{\overline{\ten{\eta}}}, \vec{\mat{B}}, \vech{\mat{\Omega}})$ be a $p(p + 2 q + 3) / 2$parameter vector and $\mat{\theta} = (\vec{\overline{\ten{\eta}}}, \vec{\mat{B}}, \vech{\mat{\Omega}})$ be the constrained parameter vector, where $\mat{B}=\bigotimes_{j = r}^{1}\mat{\beta}_j$ and $\mat{\Omega} = \bigotimes_{j = r}^{1}\mat{\Omega}_j$. We also let $\Xi$ and $\Theta$ denote the unconstrained and constrained parameter spaces, with $\mat{\xi}$ and $\mat{\theta}$ varying in $\Xi$ and $\Theta$, respectively. The parameter space $\Xi$ is an open subset of $\mathbb{R}^{p(p + 2 q + 3) / 2}$ so that \eqref{eq:quadraticexpfam} is a proper density. Later, we relax the assumptions for $\mat{\beta}_k$ and $\mat{\Omega}_k$ as a consequence of \cref{thm:parammanifold} in \cref{sec:kronmanifolds}.





Equation \eqref{eq:eta2} is underdetermined since $\t{(\mat{T}_2\pinv{\mat{D}_p})}$ has full column rank $d < p^2$ (with a nonstrict inequality if $\ten{X}$ is univariate) but $\mat{\eta}_2$ is uniquely determined given any $\mat{\Omega}$ as $\t{(\pinv{(\mat{T}_2\pinv{\mat{D}_p})})}$ has full row rank. We let $\mat{\xi} = (\vec{\overline{\ten{\eta}}}, \vec{\mat{B}}, \vech{\mat{\Omega}})$ be a $p(p + 2 q + 3) / 2$parameter vector and $\mat{\theta} = (\vec{\overline{\ten{\eta}}}, \vec{\mat{B}}, \vech{\mat{\Omega}})$ be the constrained parameter vector, where $\mat{B}=\bigotimes_{j = r}^{1}\mat{\beta}_j$ and $\mat{\Omega} = \bigotimes_{j = r}^{1}\mat{\Omega}_j$. We also let $\Xi$ and $\Theta$ denote the unconstrained and constrained parameter spaces, with $\mat{\xi}$ and $\mat{\theta}$ varying in $\Xi$ and $\Theta$, respectively. The parameter space $\Xi$ is an open subset of $\mathbb{R}^{p(p + 2 q + 3) / 2}$ so that \eqref{eq:quadraticexpfam} is a proper density. Later, we relax the assumptions for $\mat{\beta}_k$ and $\mat{\Omega}_k$ as a consequence of \cref{thm:parammanifold} in \cref{sec:kronmanifolds}.










% \todo{Maybe already here introduce the ``constraint'' set of $\Omega$'s allowed as $\{ \Omega\in\mathbb{R}_{++}^{p\times p} : \vec{\Omega} = \t{(\pinv{(\mat{T}_2\pinv{\mat{D}_p})}\mat{T}_2\pinv{\mat{D}_p})}\vec{\Omega} \}$}










In a classical \emph{generalized linear model} (GLM), the link function connecting the natural parameters to the expectation of the sufficient statistic $\mat{\eta}_y = \mat{g}(\E[\mat{t}(\ten{X}) \mid Y = y])$ is invertible. Such a link may not exist in our setting, but for our purpose what we call the ``inverse'' link suffices. The ``inverse'' link $\widetilde{\mat{g}}$ exists as the natural parameters fully describe the distribution. As in the nonminimal formulation \eqref{eq:quadraticexpfam}, we define the ``inverse'' link through its tensor valued components





\begin{align}




@ 568,7 +569,7 @@ A straightforward and general method for parameter estimation is \emph{gradient





If $\mat{T}_2$ is the identity matrix $\mat{I}_{p(p + 1) / 2}$, then $\ten{G}_2(\mat{\eta}_y) = \ten{g}_2(\mat{\eta}_y)$.





\end{theorem}










Although the general case of any GMLM model can be fitted via gradient descent using \cref{thm:grad}, this may be very inefficient. In \cref{thm:grad}, $\mat{T}_2$ can be used to introduce flexible second moment structures. For example, it allows modeling effects differently for predictor components, as described in \cref{sec:ising_estimation} after Eqn. \eqref{eq:isingcondprob}. In the remainder, we focus on $\mat{T}_2$'s that are identity matrices. This approach simplifies the estimation algorithm and the speed of the numerical calculation in the case of tensor normal predictors. In the case of the tensor normal distribution, an iterative cyclic updating scheme is derived in \cref{sec:tensor_normal_estimation}, which has much faster convergence, is stable and does not require hyper parameters, as will be discussed later. On the other hand, the Ising model does not allow such a scheme. There we need to use a gradient based method, which is the subject of \cref{sec:ising_estimation}.





Although the general case of any GMLM model can be fitted via gradient descent using \cref{thm:grad}, this may be very inefficient. In \cref{thm:grad}, $\mat{T}_2$ can be used to introduce flexible second moment structures. For example, it allows modeling effects differently for predictor components, as described in \cref{sec:ising_estimation} after Eqn. \eqref{eq:isingcondprob}. In the remainder, we focus on $\mat{T}_2$'s that are identity matrices. This approach simplifies the estimation algorithm and the speed of the numerical calculation in the case of tensor normal predictors. In the case of the tensor normal distribution, an iterative cyclic updating scheme is derived in \cref{sec:tensor_normal_estimation}, which has much faster convergence, is stable and does not require hyper parameters, as will be discussed later. On the other hand, the Ising model does not allow such a scheme. There we need to use a gradient based method, which is the subject of \cref{sec:ising_estimation}.










%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%





\subsection{Tensor Normal}\label{sec:tensor_normal_estimation}




@ 577,19 +578,20 @@ Suppose $\ten{X}\mid Y = y$ follows a tensor normal distribution with mean $\te





\begin{displaymath}





f_{\mat{\theta}}(\ten{X}\mid Y = y) = (2\pi)^{p / 2}\prod_{k = 1}^{r}\det(\mat{\Sigma}_k)^{p / 2 p_k}\exp\left( \frac{1}{2}\left\langle\ten{X}  \ten{\mu}_y, (\ten{X}  \ten{\mu}_y)\mlm_{k = 1}^{r}\mat{\Sigma}_k^{1} \right\rangle \right).





\end{displaymath}





For the sake of simplicity and w.l.o.g., we assume $\ten{X}$ has 0 marginal expectation; i.e., $\E\ten{X} = 0$. Rewriting this in the quadratic exponential family form \eqref{eq:quadraticexpfam}, determines the scaling constant $c = 1/2$. The relation to the GMLM parameters $\mat{\beta}_k$ and $\mat{\Omega}_k$, for $k = 1, \ldots, r$ is





\begin{equation}\label{eq:tnormal_cond_params}





For the sake of simplicity and w.l.o.g., we assume $\ten{X}$ has 0 marginal expectation; i.e., $\E\ten{X} = 0$. Rewriting this in the quadratic exponential family form \eqref{eq:quadraticexpfam}, determines the scaling constant $c = 1/2$. The relation to the GMLM parameters $\overline{\ten{\eta}}, \mat{\beta}_k$ and $\mat{\Omega}_k$, for $k = 1, \ldots, r$ is





\begin{displaymath}





\ten{\mu}_y = \ten{F}_y\mlm_{k = 1}^{r}\mat{\Omega}_k^{1}\mat{\beta}_k, \qquad





\mat{\Omega}_k = \mat{\Sigma}_k^{1},





\end{equation}





where we used that $\overline{\ten{\eta}} = 0$ due to $0 = \E\ten{X} = \E\E[\ten{X}\mid Y] = \E\ten{\mu}_Y$ in combination with $\E\ten{F}_Y = 0$. Additionally, all the $\mat{\Omega}_k$'s are symmetric positive definite, because the $\mat{\Sigma}_k$'s are. This lead to another simplification since then $\mat{T}_2$ in \eqref{eq:tstat} equals the identity. This also means that the gradients of the loglikelihood $l_n$ in \cref{thm:grad} are simpler. We obtain





\end{displaymath}





where we used that $\overline{\ten{\eta}} = 0$ due to $0 = \E\ten{X} = \E\E[\ten{X}\mid Y] = \E\ten{\mu}_Y$ in combination with $\E\ten{F}_Y = 0$. Additionally, all the $\mat{\Omega}_k$'s are symmetric positive definite, because the $\mat{\Sigma}_k$'s are. This lead to another simplification since then $\mat{T}_2$ in \eqref{eq:tstat} equals the identity. This also means that the gradients of the loglikelihood $l_n$ in \cref{thm:grad} are simpler. We obtain





\begin{displaymath}





\ten{g}_1(\mat{\eta}_y) = \E[\ten{X}\mid Y = y] = \ten{\mu}_y, \qquad





\ten{G}_2(\mat{\eta}_y) = \ten{g}_2(\mat{\eta}_y) = \E[\ten{X}\circ\ten{X}\mid Y = y] \equiv \bigkron_{k = r}^1\mat{\Sigma}_k + (\vec{\ten{\mu}}_y)\t{(\vec{\ten{\mu}}_y)}.





\end{displaymath}





In practice, we assume we have a random sample of $n$ observations $(\ten{X}_i, \ten{F}_{y_i})$ from the joint distribution. We start the estimation process by demeaning them. Then, only the reduction matrices $\mat{\beta}_k$ and the scatter matrices $\mat{\Omega}_k$ need to be estimated. To solve the optimization problem \eqref{eq:mle}, with $\overline{\ten{\eta}} = 0$ we initialize the parameters using a simple heuristic approach. First, we compute moment based modewise marginal covariance estimates $\widehat{\mat{\Sigma}}_k(\ten{X})$ and $\widehat{\mat{\Sigma}}_k(\ten{F}_Y)$ as





In practice, we assume we have a random sample of $n$ observations $(\ten{X}_i, \ten{F}_{y_i})$ from the joint distribution. We start the estimation process by demeaning them. Then, only the reduction matrices $\mat{\beta}_k$ and the scatter matrices $\mat{\Omega}_k$ need to be estimated. To solve the optimization problem \eqref{eq:mle}, with $\overline{\ten{\eta}} = 0$ we initialize the parameters using a simple heuristic approach. % For initial estimates $\hat{\mat{\beta}}_k^{(0)}$ we





First, we compute moment based modewise marginal covariance estimates $\widehat{\mat{\Sigma}}_k(\ten{X})$ and $\widehat{\mat{\Sigma}}_k(\ten{F}_Y)$ as





\begin{displaymath}





\widehat{\mat{\Sigma}}_k(\ten{X}) = \frac{1}{n}\sum_{i = 1}^{n} (\ten{X}_i)_{(k)}\t{(\ten{X}_i)_{(k)}}, \quad





\widehat{\mat{\Sigma}}_k(\ten{X}) = \frac{1}{n}\sum_{i = 1}^{n} (\ten{X}_i)_{(k)}\t{(\ten{X}_i)_{(k)}}, \qquad





\widehat{\mat{\Sigma}}_k(\ten{F}_Y) = \frac{1}{n}\sum_{i = 1}^{n} (\ten{F}_{y_i})_{(k)}\t{(\ten{F}_{y_i})_{(k)}}.





\end{displaymath}





Then, for every mode $k = 1, \ldots, r$, we compute the first $j = 1, \ldots, q_k$ eigenvectors $\mat{v}_j(\widehat{\mat{\Sigma}}_k(\ten{X}))$, $\mat{v}_j(\widehat{\mat{\Sigma}}_k(\ten{F}_Y))$ and eigenvalues $\lambda_j(\widehat{\mat{\Sigma}}_k(\ten{X}))$, $\lambda_j(\widehat{\mat{\Sigma}}_k(\ten{X}))$ of the marginal covariance estimates. We set




@ 598,13 +600,13 @@ Then, for every mode $k = 1, \ldots, r$, we compute the first $j = 1, \ldots, q_





\mat{D}_k &= \diag(\mat{v}_1(\widehat{\mat{\Sigma}}_1(\ten{X}))\mat{v}_1(\widehat{\mat{\Sigma}}_1(\ten{F}_{Y})), \ldots, \mat{v}_{q_k}(\widehat{\mat{\Sigma}}_{q_k}(\ten{X}))\mat{v}_{q_k}(\widehat{\mat{\Sigma}}_k(\ten{F}_{Y}))), \\





\mat{V}_k &= (\mat{v}_1(\widehat{\mat{\Sigma}}_1(\ten{F}_Y), \ldots, \mat{v}_{q_k}(\widehat{\mat{\Sigma}}_{q_k}(\ten{F}_Y)). \\





\end{align*}





The initial value of $\mat{\beta}_k$ is





The initial value of $\mat{\beta}_k$ is





\begin{displaymath}





\hat{\mat{\beta}}_k^{(0)} = \mat{U}_k\sqrt{\mat{D}_k}\t{\mat{V}_k},





\end{displaymath}





and the initial value of $\mat{\Omega}_k$ is set to the identity $\mat{\Omega}_k^{(0)} = \mat{I}_{p_k}$, for $k=1,\ldots,r$.





and the initial value of $\mat{\Omega}_k$ is set to the identity $\mat{\Omega}_k^{(0)} = \mat{I}_{p_k}$, for $k=1,\ldots,r$.










Given $\hat{\mat{\beta}}_1, \ldots, \hat{\mat{\beta}}_r, \hat{\mat{\Omega}}_1, \ldots, \hat{\mat{\Omega}}_r$, we take the gradient $\nabla_{\mat{\beta}_j}l_n$ of the tensor normal loglikelihood $l_n$ in \eqref{eq:loglikelihood} applying \cref{thm:grad} and keep all other parameters except $\mat{\beta}_j$ fixed. Then, $\nabla_{\mat{\beta}_j}l_n = 0$ has the closed form solution





Given $\hat{\mat{\beta}}_1, \ldots, \hat{\mat{\beta}}_r, \hat{\mat{\Omega}}_1, \ldots, \hat{\mat{\Omega}}_r$, we take the gradient $\nabla_{\mat{\beta}_j}l_n$ of the tensor normal loglikelihood $l_n$ in \eqref{eq:loglikelihood} applying \cref{thm:grad} and keep all other parameters except $\mat{\beta}_j$ fixed. Then, $\nabla_{\mat{\beta}_j}l_n = 0$ has the closed form solution





\begin{equation}\label{eq:tensor_normal_beta_solution}





\t{\mat{\beta}_j} = \biggl(





\sum_{i = 1}^{n}




@ 618,15 +620,16 @@ Given $\hat{\mat{\beta}}_1, \ldots, \hat{\mat{\beta}}_r, \hat{\mat{\Omega}}_1, \





\biggr)





\hat{\mat{\Omega}}_j.





\end{equation}





Equating the partial gradient of the $j$th scatter matrix $\mat{\Omega}_j$ in \cref{thm:grad} to zero ( $\nabla_{\mat{\Omega}_j}l_n = 0$) gives a quadratic matrix equation. This is due to the dependence of $\ten{\mu}_y$ on $\mat{\Omega}_j$. In practice though, it is faster, more stable, and equally accurate to use modewise covariance estimates via the residuals





%For the scatter matrices $\mat{\Omega}_j$, we need to fudge a bit.





Equating the partial gradient of the $j$th scatter matrix $\mat{\Omega}_j$ in \cref{thm:grad} to zero ( $\nabla_{\mat{\Omega}_j}l_n = 0$) gives a quadratic matrix equation. This is due to the dependence of $\ten{\mu}_y$ on $\mat{\Omega}_j$. In practice though, it is faster, more stable, and equally accurate to use modewise covariance estimates via the residuals





\begin{displaymath}





\hat{\ten{R}}_i = \ten{X}_i  \hat{\ten{\mu}}_{y_i} = \ten{X}_i  \ten{F}_{y_i}\mlm_{k = 1}^{r}\hat{\mat{\Omega}}_k^{1}\hat{\mat{\beta}}_k.





\end{displaymath}





The estimates are computed via





\end{displaymath}





The estimates are computed via





\begin{displaymath}





\tilde{\mat{\Sigma}}_j = \sum_{i = 1}^n (\hat{\ten{R}}_i)_{(j)} \t{(\hat{\ten{R}}_i)_{(j)}},





\end{displaymath}





where $\tilde{s}\tilde{\mat{\Sigma}}_j = \hat{\mat{\Omega}}_j^{1}$. For scaling we use that the mean squared error has to be equal to the trace of the covariance estimate,





where $\tilde{s}\tilde{\mat{\Sigma}}_j = \hat{\mat{\Omega}}_j^{1}$. For scaling we use that the mean squared error has to be equal to the trace of the covariance estimate,





\begin{displaymath}





\frac{1}{n}\sum_{i = 1}^n \langle \hat{\ten{R}}_i, \hat{\ten{R}}_i \rangle = \tr\bigkron_{k = r}^{1}\hat{\mat{\Omega}}_k^{1} = \prod_{k = 1}^{r}\tr{\hat{\mat{\Omega}}_k^{1}} = \tilde{s}^r\prod_{k = 1}^{r}\tr{\tilde{\mat{\Sigma}}_k},





\end{displaymath}




@ 635,38 +638,37 @@ so that





\tilde{s} = \biggl(\Bigl(\prod_{k = 1}^{r}\tr{\tilde{\mat{\Sigma}}_k}\Bigr)^{1}\frac{1}{n}\sum_{i = 1}^n \langle \hat{\ten{R}}_i, \hat{\ten{R}}_i \rangle\biggr)^{1 / r}





\end{displaymath}





resulting in the estimates $\hat{\mat{\Omega}}_j = (\tilde{s}\tilde{\mat{\Sigma}}_j)^{1}$.





Estimation is then performed by updating the estimates $\hat{\mat{\beta}}_j$ via \eqref{eq:tensor_normal_beta_solution} for $j = 1, \ldots, r$, and then recompute the $\hat{\mat{\Omega}}_j$ estimates simultaneously keeping the $\hat{\mat{\beta}}_j$'s fixed. This procedure is repeated until convergence. % Convergence is very fast, experiments showed that convergence occures usualy in less than $10$ iterations.





Estimation is then performed by updating the estimates $\hat{\mat{\beta}}_j$ via \eqref{eq:tensor_normal_beta_solution} for $j = 1, \ldots, r$, and then recompute the $\hat{\mat{\Omega}}_j$ estimates simultaneously keeping the $\hat{\mat{\beta}}_j$'s fixed. This procedure is repeated until convergence. % Convergence is very fast, experiments showed that convergence occures usualy in less than $10$ iterations.










A technical detail for numerical stability is to ensure that the scaled values $\tilde{s}\tilde{\mat{\Sigma}}_j$, assumed to be symmetric and positive definite, are well conditioned. Thus, we estimate the condition number of $\tilde{s}\tilde{\mat{\Sigma}}_j$ prior to computing the inverse. In case of illconditioning, we use the regularized $\hat{\mat{\Omega}}_j = (\tilde{s}\tilde{\mat{\Sigma}}_j + 0.2 \lambda_{1}(\tilde{s}\tilde{\mat{\Sigma}}_j)\mat{I}_{p_j})^{1}$ instead, where $\lambda_{1}(\tilde{s}\tilde{\mat{\Sigma}}_j)$ is the first (maximum) eigenvalue. Experiments showed that this regularization is usually only required in the first few iterations.





A technical detail for numerical stability is to ensure that the scaled values $\tilde{s}\tilde{\mat{\Sigma}}_j$, assumed to be symmetric and positive definite, are well conditioned. Thus, we estimate the condition number of $\tilde{s}\tilde{\mat{\Sigma}}_j$ prior to computing the inverse. In case of ill conditioning, we use the regularized $\hat{\mat{\Omega}}_j = (\tilde{s}\tilde{\mat{\Sigma}}_j + 0.2 \lambda_{1}(\tilde{s}\tilde{\mat{\Sigma}}_j)\mat{I}_{p_j})^{1}$ instead, where $\lambda_{1}(\tilde{s}\tilde{\mat{\Sigma}}_j)$ is the first (maximum) eigenvalue. Experiments showed that this regularization is usually only required in the first few iterations.










Furthermore, if the parameter space follows a more general setting as in \cref{thm:parammanifold}, updating may produces estimates outside the parameter space. A simple and efficient method is to project every updated estimate onto the corresponding manifold.





Furthermore, if the parameter space follows a more general setting as in \cref{thm:parammanifold}, updating may produces estimates outside the parameter space. A simple and efficient method is to project every updated estimate onto the corresponding manifold.










%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%





\subsection{Ising Model}\label{sec:ising_estimation}





%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%





The general distribution of a binary vector is modeled by the \emph{multivariate Bernoulli distribution} (see: \textcite{GraphicalModelsWhittaker2009, MVBDai2012, MVBDaiDingWahba2013}). The \emph{Ising model} \textcite{IsingIsing1924} is a special case, considering only twoway interactions. Its probability mass function (PMF) for a binary random vector $X\in\{ 0, 1 \}^p$ with natural parameters $\mat{\gamma}\in\mathbb{R}^{p(p + 1) / 2}$ is given by





The general distribution of a binary vector is modeled by the \emph{multivariate Bernoulli distribution} (\textcite{GraphicalModelsWhittaker2009, MVBDai2012, MVBDaiDingWahba2013}). The \emph{Ising model} \textcite{IsingIsing1924} is a special case, considering only twoway interactions. Its probability mass function (PMF) for a binary random vector $X\in\{ 0, 1 \}^p$ with natural parameters $\mat{\gamma}\in\mathbb{R}^{p(p + 1) / 2}$ is given by





\begin{displaymath}





P_{\mat{\gamma}}(\mat{x}) = p_0(\mat{\gamma})\exp(\t{\vech(\mat{x}\t{\mat{x}})}\mat{\gamma}).





\end{displaymath}





The scaling factor $p_0(\mat{\gamma})\in\mathbb{R}_{+}$ ensures that $P_{\mat{\gamma}}$ is a PMF. It is equal to the probability of the zero event $P(X = \mat{0}) = p_0(\mat{\gamma})$. More commonly known as the \emph{partition function}, the reciprocal of $p_0$, given by





The scaling factor $p_0(\mat{\gamma})\in\mathbb{R}_{+}$ ensures that $P_{\mat{\gamma}}$ is a PMF. It is equal to the probability of the zero event $P(X = \mat{0}) = p_0(\mat{\gamma})$. More commonly known as the \emph{partition function}, the reciprocal of $p_0$, is given by





\begin{equation}\label{eq:isingpartitionfunction}





p_0(\mat{\gamma})^{1} = \sum_{\mat{x}\in\{0, 1\}^p}\exp(\t{\vech(\mat{x}\t{\mat{x}})}\mat{\gamma}).





\end{equation}





By an abuse of notation, let $\mat{\gamma}_{j l}$ address the element of $\mat{\gamma}$ corresponding to $\mat{x}_j\mat{x}_l$ in $\vech(\mat{x}\t{\mat{x}})$\footnote{Specifically, the element $\mat{\gamma}_{j l}$ of $\mat{\gamma}$ is a short hand for $\mat{\gamma}_{\iota(j, l)}$ with $\iota(j, l) = (\min(j, l)  1)(2 p  \min(j, l)) / 2 + \max(j, l)$ mapping the matrix row index $j$ and column index $l$ to the corresponding half vectorization indices $\iota(j, l)$.}. The ``diagonal'' parameter $\mat{\gamma}_{j j}$ expresses the conditional log odds of $X_j = 1\mid X_{j} = \mat{0}$, where the negative subscript in $X_{j}$ describes the $p  1$ dimensional vector $X$ with the $j$'t element removed. The ``off diagonal'' parameters $\mat{\gamma}_{j l}$, for $j\neq l$, are equal to the conditional log odds of simultanious occurence $X_j = 1, X_l = 1 \mid X_{j, l} = \mat{0}$. More precise, for $j\neq l$, the conditional probabitities and the natural parameters are related by





By an abuse of notation, let $\mat{\gamma}_{j l}$ denote the element of $\mat{\gamma}$ corresponding to $\mat{x}_j\mat{x}_l$ in $\vech(\mat{x}\t{\mat{x}})$\footnote{Specifically, the element $\mat{\gamma}_{j l}$ of $\mat{\gamma}$ is a short hand for $\mat{\gamma}_{\iota(j, l)}$ with $\iota(j, l) = (\min(j, l)  1)(2 p  \min(j, l)) / 2 + \max(j, l)$ mapping the matrix row index $j$ and column index $l$ to the corresponding half vectorization indices $\iota(j, l)$.}. The ``diagonal'' parameter $\mat{\gamma}_{j j}$ expresses the conditional log odds of $X_j = 1\mid X_{j} = \mat{0}$, where the negative subscript in $X_{j}$ describes the $p  1$ dimensional vector $X$ with the $j$th element removed. The ``off diagonal'' parameters $\mat{\gamma}_{j l}$, for $j\neq l$, are equal to the conditional log odds of simultanious occurence $X_j = 1, X_l = 1 \mid X_{j, l} = \mat{0}$. More precise, for $j\neq l$, the conditional probabitities and the natural parameters are related by





\begin{align}





\mat{\gamma}_{j j} &= \log\frac{P_{\mat{\gamma}}(X_j = 1\mid X_{j} = \mat{0})}{1  P_{\mat{\gamma}}(X_j = 1\mid X_{j} = \mat{0})}, \nonumber \\





\mat{\gamma}_{j l} &= \log\frac{1  P_{\mat{\gamma}}(X_j = 1\mid X_{j} = \mat{0})P_{\mat{\gamma}}(X_l = 1\mid X_{l} = \mat{0})}{P_{\mat{\gamma}}(X_j = 1\mid X_{j} = \mat{0})P_{\mat{\gamma}}(X_l = 1\mid X_{l} = \mat{0})}\frac{P_{\mat{\gamma}}(X_j = 1, X_l = 1\mid X_{j, l} = \mat{0})}{1  P_{\mat{\gamma}}(X_j = 1, X_l = 1\mid X_{j, l} = \mat{0})} \label{eq:isingtwowaylogodds}.





\end{align}





Conditional Ising models, incorporating the information of covariates $Y$ into the model, have also been considered \textcite{sparseIsingChengEtAt2014, sdrmixedPredictorsBuraForzaniEtAl2022}. The direct way is to parameterize $\mat{\gamma} = \mat{\gamma}_y$ by the covariate $Y = y$ to model a conditional distribution $P_{\mat{\gamma}_y}(\mat{x}\mid Y = y)$.










Conditional Ising models, incorporating the information of covariates $Y$ into the model, have also been considered \textcite{sparseIsingChengEtAt2014, sdrmixedPredictorsBuraForzaniEtAl2022}. The direct way is to parameterize the parameter $\mat{\gamma} = \mat{\gamma}_y$ by the covariate $Y = y$ to model a conditional distribution $P_{\mat{\gamma}_y}(\mat{x}\mid Y = y)$.










We extend the conditional PMF by allowing the binary variables to be tensor values, that is for $\ten{X}\in\{ 0, 1 \}^{p_1\times\cdots\times p_r}$ we set $\mat{x} = \vec{\ten{X}}$, with dimension $p = \prod_{k = 1}^{r}p_k$. Considering the tensor structure of $\ten{X}$ is done by assuming Kronecker product constraints to the parameter vector $\mat{\gamma}_y$ in a similar fashion as for the tensor normal model. This means that we compair the PMF $P_{\mat{\gamma}_y}(\vec{\ten{X}}  Y = y)$ to the quadratic exponential family \eqref{eq:quadraticexpfam} with the natural parameters modeled by \eqref{eq:eta1} and \eqref{eq:eta2}. A detail to be considered is that the diagonal of $(\vec{\ten{X}})\t{(\vec{\ten{X}})}$ is equal to $\vec{\ten{X}}$. This gives the GMLM model as





We extend the conditional PMF by allowing the binary variables to be tensor values, that is for $\ten{X}\in\{ 0, 1 \}^{p_1\times\cdots\times p_r}$ we set $\mat{x} = \vec{\ten{X}}$, with dimension $p = \prod_{k = 1}^{r}p_k$. Considering the tensor structure of $\ten{X}$ is done by assuming Kronecker product constraints to the parameter vector $\mat{\gamma}_y$ in a similar fashion as for the tensor normal model. This means that we compare the PMF $P_{\mat{\gamma}_y}(\vec{\ten{X}}  Y = y)$ to the quadratic exponential family \eqref{eq:quadraticexpfam} with the natural parameters modeled by \eqref{eq:eta1} and \eqref{eq:eta2}. A detail to be considered is that the diagonal of $(\vec{\ten{X}})\t{(\vec{\ten{X}})}$ is equal to $\vec{\ten{X}}$. This gives the GMLM model as





\begin{align}





P_{\mat{\gamma}_y}(\ten{X} \mid Y = y)





&= p_0(\mat{\gamma}_y)\exp(\t{\vech((\vec{\ten{X}})\t{(\vec{\ten{X}})})}\mat{\gamma}_y) \nonumber \\





&= p_0(\mat{\gamma}_y)\exp\Bigl(\Bigl\langle \ten{X}, \ten{F}_y\mlm_{k = 1}^{r}\mat{\beta}_k \Bigr\rangle + \Bigl\langle\ten{X}\mlm_{k = 1}^{r}\mat{\Omega}_k, \ten{X}\Bigr\rangle\Bigr)\label{eq:isingcondprob}





\end{align}





where we set $\overline{\ten{\eta}} = 0$ and $\mat{T}_2$ to the identity. This is an additional constraint to the model, the reason for $\overline{\ten{\eta}} = 0$ is that the diagonal elements of $\mat{\Omega} = \bigkron_{k = r}^{1}\mat{\Omega}_k$ take the role of $\overline{\ten{\eta}}$, althoug not fully. Having the diagonal of $\mat{\Omega}$ and $\overline{\ten{\eta}}$ handling the self interaction effects might lead to interference in the optimization routine. Another approach would be to use the $\mat{T}_2$ matrix to set the corresponding diagonal elements of $\mat{\Omega}$ to zero and let $\overline{\ten{\eta}}$ handle the self interaction effect. In \cref{sec:chess} we actually use $\mat{T}_2$ to enforce zero constants for specific elments of $\mat{\Omega}$. But for simplicity, we set $\mat{T}_2$ to the identity in the following. All of those approaches, namely setting $\overline{\ten{\eta}} = 0$, keeping $\overline{\ten{\eta}}$ or using $\mat{T}_2$, are theoretically solid and compatible with \cref{thm:grad,thm:parammanifold,thm:asymptoticnormalitygmlm}, assuming all axis dimensions $p_k$ are nondegenerate, that is $p_k > 1$ for all $k = 1, \ldots, r$. Regardles, under our modeling choise we get the relation between the natural parameters $\mat{\gamma}_y$ of the conditional Ising model and the GMLM parameters $\mat{\beta}_k$ and $\mat{\Omega}_k$ as





where we set $\overline{\ten{\eta}} = 0$ and $\mat{T}_2$ to the identity. This is an additional constraint to the model, the reason is that the diagonal elements of $\mat{\Omega} = \bigkron_{k = r}^{1}\mat{\Omega}_k$ take the role of $\overline{\ten{\eta}}$, althoug not fully. Having the diagonal of $\mat{\Omega}$ and $\overline{\ten{\eta}}$ handling the self interaction effects might lead to interference in the optimization routine. Another approach would be to use the $\mat{T}_2$ matrix to set the corresponding diagonal elements of $\mat{\Omega}$ to zero and let $\overline{\ten{\eta}}$ handle the self interaction effect. All of those approaches, namely setting $\overline{\ten{\eta}} = 0$, keeping $\overline{\ten{\eta}}$ or using $\mat{T}_2$, are theoretically solid and compatible with \cref{thm:grad,thm:parammanifold,thm:asymptoticnormalitygmlm}, assuming all axis dimensions $p_k$ are nondegenerate, that is $p_k > 1$ for all $k = 1, \ldots, r$. Regardles, under our modeling choise we get the relation between the natural parameters $\mat{\gamma}_y$ of the conditional Ising model and the GMLM parameters $\mat{\beta}_k$ and $\mat{\Omega}_k$ as





\begin{equation}\label{eq:isingnaturalparams}





% \t{\pinv{\mat{D}_p}}\mat{\gamma}_y





% = \vec(\mat{\Omega} + \diag(\mat{B}\vec{\ten{F}_y}))




@ 726,9 +728,10 @@ The same problem arives in gradient optimization. Therefore, before starting the










%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%





\subsubsection{Slightly Bigger Dimensions}\label{sec:isingbiggerdim}





A big challange with the Ising model is its high computational complexity. The reason is in the sum over all binary vectors of length $p = \prod_{k = 1}^{r}p_k$ in the partition function \eqref{eq:isingpartitionfunction}. Computing the partition function exactly requires to sum all $2^p$ binary vectors. Small dimensions, like $p\approx 10$, this is easily computed. Increasing the dimension bejond $20$ gets extremely expensive while for dimensions bigger than $30$ its absolutely impossible. Trying to avoid the evaluation of the loglikelihood and only computing its partial gradients via \cref{thm:grad} does not resolve the issue. The gradients require the inverse link, in other words the second moment \eqref{eq:isingm2}, where, if dropping the scaling factor $p_0$, still involves to sum $2^p$ summands. Basically, with our model, this means that the optimization of the Ising model using exactly computed gradients is impossible for moderately sized problems.





A big challenge for the Ising model is its high computational complexity as it involves summing over all binary vectors of length $p = \prod_{k = 1}^{r}p_k$ in the partition function \eqref{eq:isingpartitionfunction}. Computing the partition function exactly requires to sum all $2^p$ binary vectors. For small dimensions, say $p\approx 10$, this is easily computed. Increasing the dimension beyond $20$ becomes extremely expensive while it is %absolutely





impossible for dimension bigger than $30$. Trying to avoid the evaluation of the loglikelihood and only computing its partial gradients via \cref{thm:grad} does not resolve the issue. The gradients require the inverse link, in other words the second moment \eqref{eq:isingm2}, where, if dropping the scaling factor $p_0$, still involves to sum $2^p$ summands. Basically, with our model, this means that the optimization of the Ising model using exactly computed gradients is impossible for moderately sized problems.










For estimation of dimensions $p$ bigger than $20$, we use a MonteCarlo method to estimate the second moment \eqref{eq:isingm2}, required to compute the partial gradients of the loglikelihood. Specifically, we use a GibbsSampler to sample from the conditional distribution and approximate the second moment in an importance sampling framework. This can be implemented quite efficiently while the estimation accuracy for the second moment is evaluated experimentaly which seems to be very reliable. Simultaniously, we use the same approach to estimate the partition function. %This though, is in comparison inaccurate, and may only be used to get a rough idea of the loglikelihood. Regardles, for our method, we only need the gradient for optimization where appropriate break conditions, not based on the likelihood, lead to a working method for MLE estimation.





For estimation of dimensions $p$ bigger than $20$, we use a MonteCarlo method to estimate the second moment \eqref{eq:isingm2}, required to compute the partial gradients of the loglikelihood. Specifically, we use a GibbsSampler to sample from the conditional distribution and approximate the second moment in an importance sampling framework. This can be implemented quite efficiently while the estimation accuracy for the second moment is evaluated experimentally which seems to be very reliable. Simultaneously, we use the same approach to estimate the partition function. This though, is in comparison inaccurate, and may only be used to get a rough idea of the loglikelihood. Regardless, for our method, we only need the gradient for optimization where appropriate break conditions, not based on the likelihood, lead to a working method for MLE estimation.










\begin{figure}





\centering




@ 744,18 +747,18 @@ For estimation of dimensions $p$ bigger than $20$, we use a MonteCarlo method t





\subsection{Kronecker Product Manifolds}\label{sec:kronmanifolds}





%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%










\cref{thm:sdr} identifies the sufficient reduction for the regression of $Y$ on $\ten{X}$ in the population. Any estimation of the sufficient reduction requires application of some optimality criterion. As we operate within the framework of the exponential family, we opted for maximum likelihood estimation (MLE). For the unconstrained problem, where the parameters are simply $\mat{B}$ and $\mat{\Omega}$ in \eqref{eq:eta1manifold}, maximizing the likelihood of $\ten{X} \mid Y$ is straightforward and yields welldefined MLEs of both parameters. Our setting, though, requires the constrained optimization of the $\ten{X} \mid Y$ likelihood subject to $\mat{B} = \bigotimes_{j = r}^{1}\mat{\beta}_j$ and $\mat{\Omega}=\bigkron_{j = r}^{1}\mat{\Omega}_j$. \Cref{thm:kronmanifolds,thm:parammanifold} provide the setting for which the MLE of the constrained parameter $\mat{\theta}$ is welldefined, which in turn leads to the derivation of its asymptotic normality.





\cref{thm:sdr} identifies the sufficient reduction for the regression of $Y$ on $\ten{X}$ in the population. Any estimation of the sufficient reduction requires application of some optimality criterion. As we operate within the framework of the exponential family, we opted for maximum likelihood estimation (MLE). For the unconstrained problem, where the parameters are simply $\mat{B}$ and $\mat{\Omega}$ in \eqref{eq:eta1manifold}, maximizing the likelihood of $\ten{X} \mid Y$ is straightforward and yields welldefined MLEs of both parameters. Our setting, though, requires the constrained optimization of the $\ten{X} \mid Y$ likelihood subject to $\mat{B} = \bigotimes_{j = r}^{1}\mat{\beta}_j$ and $\mat{\Omega}=\bigkron_{j = r}^{1}\mat{\Omega}_j$. \Cref{thm:kronmanifolds,thm:parammanifold} provide the setting for which the MLE of the constrained parameter $\mat{\theta}$ is welldefined, which in turn leads to the derivation of its asymptotic normality.










The main problem to be solves for asymtotic results of the MLE of the constraint parameter $\mat{\theta} = (\overline{\ten{\eta}}, \vec\mat{B}, \vech\mat{\Omega})$ derives from the nature of the constraint. We assumed that $\mat{B} = \bigkron_{k = r}^{1}\mat{\beta}_k$, where the parameter $\mat{B}$ is identifiable. This means that different values of $\mat{B}$ lead to different densities $f_{\mat{\theta}}(\ten{X}\mid Y = y)$, a basic property needed to ensure consistency of parameter estimates, which in turn is needed for asymptotic normality. On the other hand, the components $\mat{\beta}_j$ for $j = 1, \ldots, r$ are \emph{not} identifyable. A direct consequence of the identity $\mat{\beta}_2\otimes\mat{\beta}_1 = (c\mat{\beta}_2)\otimes (c^{1}\mat{\beta}_1)$ for every $c\neq 0$. This is the reason we formulated $\Theta$ as a constraint parameter space instead of parameterizing the densities of $\ten{X}\mid Y$ with respect to the components $\mat{\beta}_1, \ldots, \mat{\beta}_r$. The situation for $\mat{\Omega} = \bigkron_{k = r}^{1}\mat{\Omega}_k$ is the same.





The main problem in obtaining asymptotic results for the MLE of the constrained parameter $\mat{\theta} = (\overline{\ten{\eta}}, \vec\mat{B}, \vech\mat{\Omega})$ stems from the nature of the constraint. We assumed that $\mat{B} = \bigkron_{k = r}^{1}\mat{\beta}_k$, where the parameter $\mat{B}$ is identifiable. This means that different values of $\mat{B}$ lead to different densities $f_{\mat{\theta}}(\ten{X}\mid Y = y)$, a basic property needed to ensure consistency of parameter estimates, which in turn is needed for asymptotic normality. On the other hand, the components $\mat{\beta}_j$, $j = 1, \ldots, r$, are \emph{not} identifiable, which is a direct consequence of the equality $\mat{\beta}_2\otimes\mat{\beta}_1 = (c\mat{\beta}_2)\otimes (c^{1}\mat{\beta}_1)$ for every $c\neq 0$. This is the reason we formulated $\Theta$ as a constrained parameter space instead of parameterizing the densities of $\ten{X}\mid Y$ with respect to the components $\mat{\beta}_1, \ldots, \mat{\beta}_r$. The same is true for $\mat{\Omega} = \bigkron_{k = r}^{1}\mat{\Omega}_k$.










In addition to identifyable parameters, asymtotic normality given in \cref{thm:asymptoticnormalitygmlm} requires the notion of differentiation. Therefore, the space itself needs to be able to define differentiation, usually a vector space. This is a too strong assumption for our purposes. To weaken the vector space assumption we considure \emph{smooth manifolds}. Basically, spaces which localy look like Euclidean space and allow the notion of differentiation. The more general \emph{topological} manifolds are too weak for differentiation. To make matters worse, a smooth manifold only allows first derivatives. Without goint into any details, a \emph{Riemannian manifold} is the solution. Similar to an abstract \emph{smooth manifold}, those spaces are detached from our usual intuition as well as complicated to handle in an already complicated setting. This is where an \emph{embedded (sub)manifold} comes to rescue. Simply speaking, an embedded manifold is a manifold which is a subset of a manifold from which it inherits its properties. Now, if a manifold is embedded in Euclidean space, almost all the complication of the abstract manifold theory simplifies drastically. Moreover, since Euclidean space is itself a Riemannian manifold, we inherit the means for higher derivatives. Finally, smooth embedded submanifold structure for the parameter space maintains consistency with existing approaches and results for parameter sets with linear subspace structure. Those reasons are the argument to ensure that the constraint parameter space $\Theta$ is an \emph{smooth embedded submanifold} in an open subset $\Xi$ of Euclidean space.





In addition to identifiable parameters, asymptotic normality obtained in \cref{thm:asymptoticnormalitygmlm} requires differentiation. Therefore, the space itself needs to admit defining differentiation, which is usually a vector space. This is too strong an assumption for our purposes. To weaken the vector space assumption we consider \emph{smooth manifolds}. The latter are spaces which look like Euclidean spaces locally and allow the notion of differentiation. The more general \emph{topological} manifolds are too weak for differentiation. To make matters worse, a smooth manifold only allows first derivatives. Without going into details, the solution is a \emph{Riemannian manifold}. Similar to an abstract \emph{smooth manifold}, Riemannian manifolds are detached from our usual intuition as well as complicated to handle in an already complicated setting. This is where an \emph{embedded (sub)manifold} comes to rescue. Simply speaking, an embedded manifold is a manifold which is a subset of a manifold from which it inherits its properties. If a manifold is embedded in a Euclidean space, almost all the complication of the abstract manifold theory simplifies drastically. Moreover, since a Euclidean space is itself a Riemannian manifold, we inherit the means for higher derivatives. Finally, smooth embedded submanifold structure for the parameter space maintains consistency with existing approaches and results for parameter sets with linear subspace structure. These reasons justify the constraint that the parameter space $\Theta$ be an \emph{smooth embedded submanifold} in an open subset $\Xi$ of a Euclidean space.










Now, we diretly define a \emph{smooth manifold} embedded in $\mathbb{R}^p$ without any detours to the more generel theory. See for example \textcite{introToSmoothManiLee2012,,introToRiemannianManiLee2018,optimMatrixManiAbsilEtAl2007,aufbauAnalysiskaltenbaeck2021} among others.





Now, we directly define a \emph{smooth manifold} embedded in $\mathbb{R}^p$ without any detours to the more generel theory. See for example \textcite{introToSmoothManiLee2012,,introToRiemannianManiLee2018,optimMatrixManiAbsilEtAl2007,aufbauAnalysiskaltenbaeck2021} among others.





\begin{definition}[Manifolds]\label{def:manifold}





A set $\manifold{A}\subseteq\mathbb{R}^p$ is an \emph{embedded smooth manifold} of dimension $d$ if for every $\mat{x}\in\manifold{A}$ there exists a smooth\footnote{Here \emph{smooth} means infinitely differentiable or $C^{\infty}$.} bicontinuous map $\varphi:U\cap\manifold{A}\to V$, called a \emph{chart}, with $\mat{x}\in U\subseteq\mathbb{R}^p$ open and $V\subseteq\mathbb{R}^d$ open.





\end{definition}










We also need the concept of a \emph{tangent space} to formulate asymtotic normality in a way which is independent of a particular coordinate representation. Intuitively, the tangent space at a point $\mat{x}\in\manifold{A}$ of the manifold $\manifold{A}$ is the hyperspace of all velocity vectors $\t{\nabla\gamma(0)}$ of any curve $\gamma:(1, 1)\to\manifold{A}$ passing through $\mat{x} = \gamma(0)$, see \cref{fig:torus}. Locally, at $\mat{x} = \gamma(0)$ with a chart $\varphi$ we can written $\gamma(t) = \varphi^{1}(\varphi(\gamma(t)))$ which gives that $\Span\t{\nabla\gamma(0)} \subseteq \Span\t{\nabla\varphi^{1}(\varphi(\mat{x}))}$. Taking the union over all smooth curves through $\mat{x}$ gives equality. The following definition leverages the simplified setup of smooth manifolds in Euclidean space.





We also need the concept of a \emph{tangent space} to formulate asymptotic normality in a way which is independent of a particular coordinate representation. Intuitively, the tangent space at a point $\mat{x}\in\manifold{A}$ of the manifold $\manifold{A}$ is the hyperspace of all velocity vectors $\t{\nabla\gamma(0)}$ of any curve $\gamma:(1, 1)\to\manifold{A}$ passing through $\mat{x} = \gamma(0)$, see \cref{fig:torus}. Locally, at $\mat{x} = \gamma(0)$ with a chart $\varphi$ we can written $\gamma(t) = \varphi^{1}(\varphi(\gamma(t)))$ which gives that $\Span\t{\nabla\gamma(0)} \subseteq \Span\t{\nabla\varphi^{1}(\varphi(\mat{x}))}$. Taking the union over all smooth curves through $\mat{x}$ gives equality. The following definition leverages the simplified setup of smooth manifolds in Euclidean space.










\begin{definition}[Tangent Space]\label{def:tangentspace}





Let $\manifold{A}\subseteq\mathbb{R}^p$ be an embedded smooth manifold and $\mat{x}\in\manifold{A}$. The \emph{tangent space} at $\mat{x}$ of $\manifold{A}$ is defined as




@ 776,7 +779,7 @@ We also need the concept of a \emph{tangent space} to formulate asymtotic normal




















As a basis to ensure that the constraint parameter space $\Theta$ is a manifold, which is the statement of \cref{thm:parammanifold}, we need \cref{thm:kronmanifolds}. Therefore, we need the notion of a \emph{spherical} set, which is a set $\manifold{A}$, on which the Frobenius norm is constant. That is, $\\,.\,\_F:\manifold{A}\to\mathbb{R}$ is constant. Forthermore, we call a scale invariant set $\manifold{A}$ a \emph{cone}, that is $\manifold{A} = \{ c \mat{A} : \mat{A}\in\manifold{A} \}$ for all $c > 0$.





As a basis to ensure that the constrained parameter space $\Theta$ is a manifold, which is a requirement of \cref{thm:parammanifold}, we need \cref{thm:kronmanifolds}. Therefore, we need the notion of a \emph{spherical} set, which is a set $\manifold{A}$, on which the Frobenius norm is constant. That is, $\\,.\,\_F:\manifold{A}\to\mathbb{R}$ is constant. Forthermore, we call a scale invariant set $\manifold{A}$ a \emph{cone}, that is $\manifold{A} = \{ c \mat{A} : \mat{A}\in\manifold{A} \}$ for all $c > 0$.










\begin{theorem}[Kronecker Product Manifolds]\label{thm:kronmanifolds}





Let $\manifold{A}\subseteq\mathbb{R}^{p_1\times q_1}\backslash\{\mat{0}\}, \manifold{B}\subseteq\mathbb{R}^{p_2\times q_2}\backslash\{\mat{0}\}$ be smooth embedded submanifolds. Assume one of the following conditions holds.




@ 796,7 +799,7 @@ As a basis to ensure that the constraint parameter space $\Theta$ is a manifold,





\quad\text{and}\quad





\manifold{K}_{\mat{\Omega}} = \Bigl\{ \bigkron_{k = r}^{1}\mat{\Omega}_k : \mat{\Omega}_k\in\manifold{O}_k \Bigr\}





\end{displaymath}





where $\manifold{B}_k\subset\mathbb{R}^{p_k\times q_k}\backslash\{\mat{0}\}$ and $\manifold{O}_k\subset\mathbb{R}^{p_k\times p_k}\backslash\{\mat{0}\}$ are smooth embedded manifolds which are ether spheres or cones, for $k = 1, ..., r$. Furthermore, let





where $\manifold{B}_k\subset\mathbb{R}^{p_k\times q_k}\backslash\{\mat{0}\}$ and $\manifold{O}_k\subset\mathbb{R}^{p_k\times p_k}\backslash\{\mat{0}\}$ are smooth embedded manifolds which are either spheres or cones, for $k = 1, ..., r$. Furthermore, let





\begin{displaymath}





\manifold{CK}_{\mat{\Omega}} = \{ \vech{\mat{\Omega}} : \mat{\Omega}\in\manifold{K}_{\mat{\Omega}} \land \pinv{(\mat{T}_2\pinv{\mat{D}_p})}\mat{T}_2\pinv{\mat{D}_p}\vec{\mat{\Omega}} = \vec{\mat{\Omega}} \}





\end{displaymath}




@ 804,9 +807,9 @@ As a basis to ensure that the constraint parameter space $\Theta$ is a manifold,





\end{theorem}










\subsection{Matrix Manifolds}\label{sec:matrixmanifolds}





A powerful side effect of \cref{thm:parammanifold} is the modeling flexibility it provides. For example, we can perform low rank regression. Or, we may constrain twoway interactions between direct axis neighbors by using band matrices for the $\mat{\Omega}_k$'s, among others.





A powerful side effect of \cref{thm:parammanifold} is the modeling flexibinity it provides. For example, we can perform low rank regression. Or, we may constrain twoway interactions between direct axis neighbors by using band matrices for the $\mat{\Omega}_k$'s, among others.










This flexibility derives from many different matrix manifolds that can be used as building blocks $\manifold{B}_k$ and $\manifold{O}_k$ of the parameter space $\Theta$ in \cref{thm:parammanifold}. A list of possible choices, among others, is given in \cref{tab:matrixmanifolds}. As long as parameters in $\Theta$ are valid paramererization of a density (or PMF) of \eqref{eq:quadraticexpfam} subject to \eqref{eq:eta1manifold} and \eqref{eq:eta2manifold}, one may choose any of the manifolds listed in \cref{tab:matrixmanifolds} which are ether cones or spherical. We also included an example which is nether a sphere nor a cone. They may also be valid building blocks, but require more work as they are not directly leading to a parameter manifold by \cref{thm:parammanifold}. In case one can show the resulting parameter space $\Theta$ is an embedded manifold, the asymtotic theory of \cref{sec:asymtotics} is applicable.





This flexibility derives from many different matrix manifolds that can be used as building blocks $\manifold{B}_k$ and $\manifold{O}_k$ of the parameter space $\Theta$ in \cref{thm:parammanifold}. A list of possible choices, among others, is given in \cref{tab:matrixmanifolds}. As long as parameters in $\Theta$ are valid paramererization of a density (or PMF) of \eqref{eq:quadraticexpfam} subject to \eqref{eq:eta1manifold} and \eqref{eq:eta2manifold}, one may choose any of the manifolds listed in \cref{tab:matrixmanifolds} which are either cones or spherical. We also included an example which is neither a sphere nor a cone. They may also be valid building blocks, but require more work as they are not directly leading to a parameter manifold by \cref{thm:parammanifold}. In case one can show the resulting parameter space $\Theta$ is an embedded manifold, the asymptotic theory of \cref{sec:asymtotics} is applicable.










\begin{table}





\centering




@ 814,7 +817,7 @@ This flexibility derives from many different matrix manifolds that can be used a





Symbol & Description & C & S & Dimension\\ \hline





$\mathbb{R}^{p\times q}$ & All matrices of dimension $p\times q$ &





\checkmark & \xmark & $p q$ \\ \hline





$\mathbb{R}_{*}^{p\times q}$ & Full rank $p\times q$ matrices &





$\mathbb{R}_{*}^{p\times q}$ & Full rank $p\times q$ matrices &





\checkmark & \xmark & $p q$ \\ \hline





$\Stiefel{p}{q}$ & \emph{Stiefel Manifold}, $\{ \mat{U}\in\mathbb{R}^{p\times q} : \t{\mat{U}}\mat{U} = \mat{I}_q \}$ for $q\leq p$ &





\xmark & \checkmark & $p q  q (q + 1) / 2$ \\ \hline




@ 848,7 +851,7 @@ This flexibility derives from many different matrix manifolds that can be used a










\subsection{Asymptotics}\label{sec:asymtotics}










Let $Z$ be a random variable distributed according to a parameterized probability distribution with density $f_{\mat{\theta_0}}\in\{ f_{\mat{\theta}} : \mat{\theta}\in\Theta \}$ where $\Theta$ is a subset of Euclidean space. We want to estimate the parameter ${\mat{\theta}}_0$ using $n$ i.i.d. (independent and identically distributed) copies of $Z$. We assume a known, realvalued and measurable function $z\mapsto m_{\mat{\theta}}(z)$ for every $\mat{\theta}\in\Theta$ while ${\mat{\theta}}_0$ maximizes the map $\mat{\theta}\mapsto M(\mat{\theta}) = \E m_{\mat{\theta}}(Z)$ uniquely. For the estimation we maximize the empirical version





Let $Z$ be a random variable distributed according to a parameterized probability distribution with density $f_{\mat{\theta_0}}\in\{ f_{\mat{\theta}} : \mat{\theta}\in\Theta \}$ where $\Theta$ is a subset of a Euclidean space. We want to estimate the parameter ${\mat{\theta}}_0$ using $n$ i.i.d. (independent and identically distributed) copies of $Z$. We assume a known, realvalued and measurable function $z\mapsto m_{\mat{\theta}}(z)$ for every $\mat{\theta}\in\Theta$ and that ${\mat{\theta}}_0$ is the unique maximizer of the map $\mat{\theta}\mapsto M(\mat{\theta}) = \E m_{\mat{\theta}}(Z)$. For the estimation we maximize the empirical version





\begin{align}\label{eq:Mn}





M_n(\mat{\theta}) &= \frac{1}{n}\sum_{i = 1}^n m_{\mat{\theta}}(Z_i).





\end{align}




@ 856,8 +859,6 @@ An \emph{Mestimator} $\hat{\mat{\theta}}_n = \hat{\mat{\theta}}_n(Z_1, ..., Z_n





\begin{displaymath}





\hat{\mat{\theta}}_n = \argmax_{\mat{\theta}\in\Theta} M_n(\mat{\theta}).





\end{displaymath}















It is not necessary to have a perfect maximizer, as long as the objective has finite supremum, it is sufficient to take an \emph{almost maximizer} $\hat{\mat{\theta}}_n$ as defined in the following;










\begin{definition}[weak and strong Mestimators]




@ 879,7 +880,11 @@ It is not necessary to have a perfect maximizer, as long as the objective has fi





\end{theorem}















The following is a reformulation of \textcite[Lemma~2.3]{asymptoticMLEBuraEtAl2018} which assumes Condition~2.2 to hold. But the existence of a mapping in Condition~2.2 is not needed for Lemma~2.3. It suffices that the restricted parameter space $\Theta$ is a subset of the unrestricted parameter space $\Xi$, which is trivially satisfied in our setting. Under this, \cref{thm:existsstrongMestimatoronsubsets} follows directly from \textcite[Lemma~2.3]{asymptoticMLEBuraEtAl2018}.





%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%





\subsection{Asymptotic Normality}





%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%










The following is a reformulation of \textcite[Lemma~2.3]{asymptoticMLEBuraEtAl2018} which assumes Condition~2.2 to hold. The existence of a mapping in Condition~2.2 is not needed for Lemma~2.3. It suffices that the restricted parameter space $\Theta$ is a subset of the unrestricted parameter space $\Xi$, which is trivially satisfied in our setting. Under this, \cref{thm:existsstrongMestimatoronsubsets} follows directly from \textcite[Lemma~2.3]{asymptoticMLEBuraEtAl2018}.










\begin{theorem}[Existence of strong Mestimators on Subsets]\label{thm:existsstrongMestimatoronsubsets}





Assume there exists a (weak/strong) Mestimator $\hat{\mat{\xi}}_n$ for $M_n$ over $\Xi$, then there exists a strong Mestimator $\hat{\mat{\theta}}_n$ for $M_n$ over any nonempty $\Theta\subseteq\Xi$.




@ 887,16 +892,16 @@ The following is a reformulation of \textcite[Lemma~2.3]{asymptoticMLEBuraEtAl2















\begin{theorem}[Existence and Consistency of Mestimators on Subsets]\label{thm:Mestimatorconsistencyonsubsets}





Let $\Xi$ be a convex open subset of a Euclidean space and $\Theta\subseteq\Xi$ nonempty. Assume $\mat{\xi}\mapsto m_{\mat{\xi}}(z)$ is a strictly concave function on $\Xi$ for almost all $z$ and $z\mapsto m_{\mat{\xi}}(z)$ is measurable for all $\mat{\xi}\in\Xi$. Furthermore, let $M(\mat{\xi}) = \E m_{\mat{\xi}}(Z)$ be a well defined function with a unique maximizer $\mat{\theta}_0\in\Theta\subseteq\Xi$, that is $M(\mat{\theta}_0) > M(\mat{\xi})$ for all $\mat{\xi}\neq\mat{\theta}_0$. Also, let for every nonempty compact $K\subset\Xi$





Let $\Xi$ be a convex open subset of a Euclidean space and $\Theta\subseteq\Xi$ nonempty. Assume $\mat{\xi}\mapsto m_{\mat{\xi}}(z)$ is a strictly concave function on $\Xi$ for almost all $z$ and $z\mapsto m_{\mat{\xi}}(z)$ is measurable for all $\mat{\xi}\in\Xi$. Let $M(\mat{\xi}) = \E m_{\mat{\xi}}(Z)$ be a well defined function with a unique maximizer $\mat{\theta}_0\in\Theta\subseteq\Xi$; that is, $M(\mat{\theta}_0) > M(\mat{\xi})$ for all $\mat{\xi}\neq\mat{\theta}_0$. Also, assume





\begin{displaymath}





\E\sup_{\mat{\xi}\in K}m_{\mat{\xi}}(Z) < \infty.





\E\sup_{\mat{\xi}\in K}m_{\mat{\xi}}(Z) < \infty,





\end{displaymath}





Then, there exists a strong Mestimator $\hat{\mat{\theta}}_n$ of $M_n(\mat{\theta}) = \frac{1}{n}\sum_{i = 1}^{n} m_{\mat{\theta}}(Z_i)$ over the subset $\Theta$. Moreover, any strong Mestimator $\hat{\mat{\theta}}_n$ of $M_n$ over $\Theta$ converges in probability to $\mat{\theta}_0$, that is $\hat{\mat{\theta}}_n\xrightarrow{p}\mat{\theta}_0$.





for every nonempty compact $K\subset\Xi$. Then, there exists a strong Mestimator $\hat{\mat{\theta}}_n$ of $M_n(\mat{\theta}) = \frac{1}{n}\sum_{i = 1}^{n} m_{\mat{\theta}}(Z_i)$ over the subset $\Theta$. Moreover, any strong Mestimator $\hat{\mat{\theta}}_n$ of $M_n$ over $\Theta$ converges in probability to $\mat{\theta}_0$, that is $\hat{\mat{\theta}}_n\xrightarrow{p}\mat{\theta}_0$.





\end{theorem}




















\todo{The assumptions of the following can be a bit weakened, is this neccessary? For example the Hessian can be singular but is nonsingular constraint to the tangent space. We can also only define $\mat{\theta}\mapsto m_{\mat{\theta}}$ only on the manifold which makes the statement much more technical, but way more general while we need to ensure that every smooth local extention of $\mat{\theta}\mapsto m_{\mat{\theta}}$ yields the same statement, which it does, but well, then it gets more complicated! Maybe add these things as a remark? The even more general formulation for Riemannian Manifolds is definitely over the top!}





\todo{The assumptions of the following can be a bit weakened, is this neccessary? For example the Hessian can be singular but is nonsingular constrained to the tangent space. We can also only define $\mat{\theta}\mapsto m_{\mat{\theta}}$ only on the manifold which makes the statement much more technical, but way more general while we need to ensure that every smooth local extension of $\mat{\theta}\mapsto m_{\mat{\theta}}$ yields the same statement, which it does, but well, then it gets more complicated! Maybe add these things as a remark? The even more general formulation for Riemannian Manifolds is definitely over the top!}










\begin{theorem}[Asymptotic Normality for Mestimators on Manifolds]\label{thm:Mestimatorasymnormalonmanifolds}





Let $\Theta\subseteq\mathbb{R}^p$ be a smooth embedded manifold. For each $\mat{\theta}$ in a neighborhood in $\mathbb{R}^p$ of the true parameter $\mat{\theta}_0\in\Theta$ let $z\mapsto m_{\mat{\theta}}(z)$ be measurable and $\mat{\theta}\mapsto m_{\mat{\theta}}(z)$ be differentiable at $\mat{\theta}_0$ for almost all $z$. Assume also that there exists a measurable function $u$ such that $\E[u(Z)^2] < \infty$, and for almost all $z$ as well as all $\mat{\theta}_1, \mat{\theta}_2$ in a neighborhood of $\mat{\theta}_0$ such that




@ 921,15 +926,15 @@ The following is a reformulation of \textcite[Lemma~2.3]{asymptoticMLEBuraEtAl2





%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%





\section{Simulations}





%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%





In this section we provide simulation results for the tensor normal as well as the Ising model where different aspects of the GMLM model are compaired against other methods. The comparison methods are TSIR \textcite{tsirDingCook2015}, MGCCA \textcite{MGCCAGirkaEtAl2024} and HOPCA \todo{cite?!} for both continuous and binary data. In case of binary data, the binary values are simply treated as continuous. As a base line we also include classic PCA on vectorized observations. \todo{check, fix, ...}





In this section we provide simulation results for the tensor normal as well as the Ising model where different aspects of the GMLM model are compaired against other methods. The comparison methods are tensor Sliced Inverse Regression (TSIR) \textcite{tsirDingCook2015}, MGCCA \textcite{MGCCAGirkaEtAl2024} and the Tucker decomposition that is a higherorder form of principal component analysis (HOPCA) \textcite{KoldaBader2009}, for both continuous and binary data. For the latter, the binary values are simply treated as continuous. As a base line we also include classic PCA on vectorized observations. \todo{check, fix, ...}










All experiments are performed with different sample sizes $n = 100, 200, 300, 500$ and $750$. Every experiment is repeated $100$ times.










We are interested in the quality of the estimate of the true sufficient reduction $\ten{R}(\ten{X})$ from \cref{thm:sdr}. Therefore, we compair against the true vectorized reduction matrix $\mat{B} = \bigkron_{k = r}^{1}\mat{\beta}_k$, since the vectorized version is compatible with any linear reduction method. The distance $d(\mat{B}, \hat{\mat{B}})$ between $\mat{B}\in\mathbb{R}^{p\times q}$ and an estimate $\hat{\mat{B}}\in\mathbb{R}^{p\times \tilde{q}}$ is the \emph{subspace distance} which is proportional to





We are interested in the quality of the estimate of the true sufficient reduction $\ten{R}(\ten{X})$ from \cref{thm:sdr}. Therefore, we compare with the true vectorized reduction matrix $\mat{B} = \bigkron_{k = r}^{1}\mat{\beta}_k$, as it is compatible with any linear reduction method. The distance $d(\mat{B}, \hat{\mat{B}})$ between $\mat{B}\in\mathbb{R}^{p\times q}$ and an estimate $\hat{\mat{B}}\in\mathbb{R}^{p\times \tilde{q}}$ is the \emph{subspace distance} which is proportional to





\begin{displaymath}





d(\mat{B}, \hat{\mat{B}}) \propto \ \mat{B}\pinv{(\t{\mat{B}}\mat{B})}\t{\mat{B}}  \hat{\mat{B}}\pinv{(\t{\hat{\mat{B}}}\hat{\mat{B}})}\t{\hat{\mat{B}}} \_F.





d(\mat{B}, \hat{\mat{B}}) \propto \ \mat{B}\pinv{(\t{\mat{B}}\mat{B})}\t{\mat{B}}  \hat{\mat{B}}\pinv{(\t{\hat{\mat{B}}}\hat{\mat{B}})}\t{\hat{\mat{B}}} \_F,





\end{displaymath}





That is the Frobenius norm of the diffeence between the projections onto the span of $\mat{B}$ and $\hat{\mat{B}}$. The proportionality constant\footnote{Depends on row dimension $p$ and the ranks of $\mat{B}$ and $\hat{\mat{B}}$ given by $(\min(\rank\mat{B} + \rank\hat{\mat{B}}, 2 p  (\rank\mat{B} + \rank\hat{\mat{B}})))^{1/2}$.} of $d(\mat{B}, \hat{\mat{B}})$ ensures that the subspace distance is in the interval $[0, 1]$. A distance of zero implies equality of the spans, a distance of one means that the subspaces are orthogonal.





the Frobenius norm of the difference between the projections onto the span of $\mat{B}$ and $\hat{\mat{B}}$. The proportionality constant\footnote{Depends on row dimension $p$ and the ranks of $\mat{B}$ and $\hat{\mat{B}}$ given by $(\min(\rank\mat{B} + \rank\hat{\mat{B}}, 2 p  (\rank\mat{B} + \rank\hat{\mat{B}})))^{1/2}$.} of $d(\mat{B}, \hat{\mat{B}})$ ensures that the subspace distance is in the interval $[0, 1]$. A distance of zero implies space overlap, a distance of one means that the subspaces are orthogonal.










%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%





\subsection{Tensor Normal}\label{sec:simtensornormal}




@ 1052,7 +1057,7 @@ The results of 1c) are surprising. The GMLM model behaves as expected, clearly b





%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%





\section{Data Analysis}





%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%





In this section be perform two \todo{realy two!} applications of the GMLM model on real data. First example is the tensor normal model applied to EED data\todo{do this!}. Next, we perform a prove of concept data analysis example for chess. The main purpose of choosing chess is two fold. First, we can ilustrate an explicit use case for the (till now ignored) linear constraint matrix $\mat{T}_2$. Second, its a personal interest of one of authors.\todo{???}





In this section be perform two \todo{realy two!} applications of the GMLM model on real data. First example is the tensor normal model applied to EEG data\todo{do this!}. Next, we perform a prove of concept data analysis example for chess. The main purpose of choosing chess is two fold. First, we can ilustrate an explicit use case for the (till now ignored) linear constraint matrix $\mat{T}_2$. Second, its a personal interest of one of authors.\todo{???}















%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%




@ 1060,9 +1065,9 @@ In this section be perform two \todo{realy two!} applications of the GMLM model










%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%





\subsection{Chess}\label{sec:chess}





The data set is provided by the \citetitle{lichessdatabase}\footnote{\fullcite{lichessdatabase}}. We downloaded November of 2023 consisting of more than $92$ million games. We removed all games without position evaluations. The evaluations, also denoted scores, are from Stockfish,\footnote{\fullcite{stockfish}} a free and strong chess engine. The scores take the role of the response $Y$ and correspond to a winning probability from whites point of few. Positive scores are good for white and negative scores indicate an advantage for black. We ignore all highlty unballanced positions, which we set to be positions with absolute score above $5$. We also remove all positions with a mate score (one side can force check mate). Finally, we only considure positions with white to move. This leads to a final data set of roughly $190$ million positions, including duplicates.





The data set is provided by the \citetitle{lichessdatabase}\footnote{\fullcite{lichessdatabase}}. We downloaded November of 2023 consisting of more than $92$ million games. We removed all games without position evaluations. The evaluations, also denoted scores, are from Stockfish,\footnote{\fullcite{stockfish}} a free and strong chess engine. The scores take the role of the response $Y$ and correspond to a winning probability from whites point of few. Positive scores are good for white and negative scores indicate an advantage for black. We ignore all highly unbalanced positions, which we set to be positions with absolute score above $5$. We also remove all positions with a mate score (one side can force check mate). Furthermore, we only consider positions after $10$ halfmoves to avoid oversampling the beginning of the most common openings including the start position which is in every game. Finally, we only consider positions with white to move. This leads to a final data set of roughly $64$ million positions, including duplicates.










A chess position is encoded as a 3D binary tensor of dimension $8\times 8\times 12$ giving the predictors $\ten{X}$, see \cref{fig:fen2tensor}. The first two axis encode the squares of a chess board, which is a $8\times 8$ grid. The third axis encodes chess pieces. The $12$ pieces derive from the $6$ types of pieces, namely pawns (\pawn), knights (\knight), bishops (\bishop), queens (\queen) and kings (\king) of two colors, black and white.





A chess position is encoded as a set of $12$ binary matrices $\ten{X}_{\mathrm{piece}}$ of dimensions $8\times 8$. Every binary matrix encodes the positioning of a particular piece by containing a $1$ if the piece is present at the corresponding board position. The $12$ pieces derive from the $6$ types of pieces, namely pawns (\pawn), knights (\knight), bishops (\bishop), queens (\queen) and kings (\king) of two colors, black and white. See \cref{fig:fen2tensor} for a visualization.










\begin{figure}[hp!]





\centering




@ 1070,11 +1075,64 @@ A chess position is encoded as a 3D binary tensor of dimension $8\times 8\times





\caption{\label{fig:fen2tensor}The chess start position and its 3D binary tensor representation, empty entries are $0$.}





\end{figure}










Now, we assume that $(\ten{X}, y)$ follows the Ising GMLM model \cref{sec:ising_estimation}. The relation to the scores $y$ is modeled via a cubic relation by the $2\times 2\times 2$ dimensional tensor $\ten{F}_y$ with elements $(\ten{F}_y)_{i j k} = y^{i + j + k  3}$.





We assume that $\ten{X}_{\mathrm{piece}}\mid Y = y$ follows an Ising GMLM model \cref{sec:ising_estimation} with different conditional piece predictors being independent. The independence assumption is for the sake of simplicity even though this is clearly not the case in the underlying true distribution. See \cref{sec:chessdiscussion} for a short discussion of improvements. By this simplifying assumption we get a mixture model with the loglikelihood





\begin{displaymath}





l_n(\mat{\theta}) = \frac{1}{12}\sum_{\mathrm{piece}}l_n(\mat{\theta}_{\mathrm{piece}})





\end{displaymath}





where $l_n(\mat{\theta}_{\mathrm{piece}})$ is the Ising GMLM loglikelihood as in \cref{sec:ising_estimation} for $\ten{X}_{\mathrm{piece}}\mid Y = y$. For every component the same relation to the scores $y$ is modeled via a $2\times 2$ dimensional matrix valued function $\ten{F}_y$ consisting of the monomials $1, y, y^2$, specifically $(\ten{F}_y)_{i j} = y^{i + j  2}$.










By the raw scale of the data, millions of observations, it is computationally infeasible to compute the gradients on the entire data set. Simply using a computationally manageable subset is not an option. Due to the high dimension on binary data, which is $12$ times a $8\times 8$ for every observation giving a total dimension of $768$. The main issue is that a manageable subset, say one million observations, still leads to a degenerate data set. In our simplified mixture model, the pawns are a specific issue as there are multiple millions of different combinations of the $8$ pawns per color on the $6\times 8$ sub grid the pawns can be positioned. This allown does not allow to take a reasonable sized subset for estimation. The solution is to switch from a classic gradient based optimization to a stochastic version. This means that every gradient update uses a new random subset of the entire data set. Therefore, we draw independent random samples form the data consisting of $64$ million positions. The independence of samples derived from the independence of games, and every sample is drawn from a different game.










\paragraph{Validation:}





Given the nonlinear nature of the reduction, due to the quadratic matrix valued function $\ten{F}_y$ of the score $y$, we use a \emph{generalized additive model}\footnote{using the function \texttt{gam()} from the \texttt{R} package \texttt{mgcv}.} (GAM) to predict position scores from reduced positions. The reduced positions are $48$ dimensional continuous values by combining the $12$ mixture components from the $2\times 2$ matrix valued reductions per piece. The per piece reduction is





\begin{displaymath}





\ten{R}(\ten{X}_{\mathrm{piece}}) = \mat{\beta}_{1,\mathrm{piece}}(\ten{X}_{\mathrm{piece}}  \E\ten{X}_{\mathrm{piece}})\t{\mat{\beta}_{2, \mathrm{piece}}}





\end{displaymath}





which gives the complete $48$ dimensional vectorized reduction by stacking the piece wise reductions





\begin{displaymath}





\vec{\ten{R}(\ten{X}})





= (\vec{\ten{R}(\ten{X}_{\text{white pawn}})}, \ldots, \vec{\ten{R}(\ten{X}_{\text{black king}})})





= \t{\mat{B}}\vec(\ten{X}  \E\ten{X}).





\end{displaymath}





The second line encodes all the piece wise reductions in a block diagonal full reduction matrix $\mat{B}$ of dimension $768\times 48$ which is applied to the vectorized 3D tensor $\ten{X}$ combining all the piece components $\ten{X}_{\mathrm{piece}}$ into a single tensor of dimension $8\times 8\times 12$. This is a reduction to $6.25\%$ of the original dimension. The $R^2$ statistic of the GAM fitted on $10^5$ new reduced samples is $R^2_{gam}\approx 42\%$. A linear model on the reduced data achieves $R^2_{lm}\approx 25\%$ which clearly shows the nonlinear relation. On the other hand, the static evaluation of the \emph{Schach H\"ornchen}\todo{ref/mention/?!?} engine, given the full position (\emph{not} reduced), achieves an $R^2_{hce}\approx 51\%$. The $42\%$ are reasonably well comparied to $51\%$ of the engine static evaluation which gets the original position and uses chess specific expect knowledge. Features the static evaluation includes, which are expected to be learned by the GMLM mixture model, are; \emph{material} (piece values) and \emph{piece square tables} (PSQT, preferred piece type positions). In addition, the static evaluation includes chess specific features like \emph{king safety}, \emph{pawn structure} or \emph{rooks on open files}. This lets us conclude that the reduction captures most of the relevant features possible, given the oversimplified modeling we performed.










\paragraph{Interpretation:} For a compact interpretation of the estimated reduction we construct PSQTs. To do so we use the linear model from the validation section. Then, we rewrite the combined linear reduction and linear model in terms of PSQTs. Let $\mat{B}$ be the $768\times 48$ full vectorized linear reduction. This is the block diagonal matrix with the $64\times 4$ dimensional per piece reductions $\mat{B}_{\mathrm{piece}} = \mat{\beta}^{\mathrm{piece}}_2\otimes\mat{\beta}^{\mathrm{piece}}_1$. Then, the linear model with coefficients $\mat{b}$ and intercept $a$ on the reduced data is given by





\begin{equation}\label{eq:chesslm}





y = a + \t{\mat{b}}\t{\mat{B}}\vec(\ten{X}  \E\ten{X}) + \epsilon





\end{equation}





with an unknown mean zero error term $\epsilon$ and treating the binary tensor $\ten{X}$ as continuous. Decomposing the linear model coefficients into blocks of $4$ gives per piece coefficients $\mat{b}_{\mathrm{piece}}$ which combine with the diagonal blocks $\mat{B}_{\mathrm{piece}}$ of $\mat{B}$ only. Rewriting \eqref{eq:chesslm} gives





\begin{align*}





y





&= a + \sum_{\mathrm{piece}}\t{(\mat{B}_{\mathrm{piece}}\mat{b}_{\mathrm{piece}})}\vec(\ten{X}_{\mathrm{piece}}  \E\ten{X}_{\mathrm{piece}}) + \epsilon \\





&= \tilde{a} + \sum_{\mathrm{piece}}\langle





\mat{B}_{\mathrm{piece}}\mat{b}_{\mathrm{piece}},





\vec(\ten{X}_{\mathrm{piece}})





\rangle + \epsilon





\end{align*}





with a new intercept term $\tilde{a}$, which is of no interest to us. Finally, we enforce a color symmetry, using known mechanism from chess engines. Specifically, mirroring the position changes the sign of the score $y$. Here, mirroring reverses the rank (row) order, this is the image in a mirror behind a chess board. Let for every $\mat{C}_{\mathrm{piece}}$ be a $8\times 8$ matrix with elements $(\mat{C}_{\mathrm{piece}})_{i j} = (\mat{B}_{\mathrm{piece}}\mat{b}_{\mathrm{piece}})_{i + 8 (j  1)}$. And denote with $\mat{M}(\mat{A})$ the matrix mirror operation which reverses the row order of a matrix. Using this new notation allows to enforcing this symmetry leading to the new approximate linear relation





\begin{align*}





y &= \tilde{a} + \sum_{\mathrm{piece}}\langle





\mat{C}_{\mathrm{piece}},





\ten{X}_{\mathrm{piece}}





\rangle + \epsilon \\





&\approx \tilde{a} + \sum_{\text{piece type}}\frac{1}{2}\langle





\mat{C}_{\text{white piece}}  \mat{M}(\mat{C}_{\text{black piece}}),





\ten{X}_{\text{white piece}}  \mat{M}(\ten{X}_{\text{white piece}})





\rangle + \epsilon





\end{align*}





If for every piece type ($6$ types, \emph{not} distinguishing between color) holds $\mat{C}_{\text{white piece}} = \mat{M}(\mat{C}_{\text{black piece}})$, then we have equality. In our case this is valid given that the estimates $\hat{\mat{C}}_{\mathrm{piece}}$ fulfill this property with a small error. The $6$ matrices $(\mat{C}_{\text{white piece}}  \mat{M}(\mat{C}_{\text{black piece}})) / 2$ are called \emph{piece square tables} (PSQT) which are visualized in \cref{fig:psqt}. The interpretation of those tables is straight forward. A high positive values (blue) means that it is usually good to have a piece of the corresponding type on that square while a high negative value (red) means the opposite. It needs to be considered that the PSQTs are for quiet positions only, that means all pieces are save in the sense that there is no legal capturing moves nore is the king in check.










\begin{figure}[hp!]





\centering





\includegraphics[width = \textwidth]{plots/psqt.pdf}





\caption{\label{fig:psqt}Extracted PSQTs (piece square tables) from the chess example GMLM reduction.}





\end{figure}










The first visual effect in \cref{fig:psqt} is the dark blue PSQT of the Queen followed by a not so dark Rook PSQT. This indicated that the Queen, followed by the Rook, are the most value pieces (after the king, but a king piece value makes no sense). The next two are the Knight and Bishop which have higher value than the Pawns, ignoring the $6$th and $7$th rank as this makes the pawns a potential queen. This is the classic piece value order known in chess.










Next, goint one by one through the PSQTs, a few words about the prefered positions for every piece type. The pawn positions are specifically good on the $6$th and especially on the $7$th rank as this threatens a promotion to a Queen (or Knight, Bishop, Rook). The Knight PSQT is a bit surprising, the most likely explanation for the knight being good in the enemy territory is that it got there by capturing an enemy piece for free. A common occurency in low rated games which is a big chunk of the training data, ranging over all levels. The Bishops sem to have no specific prefered placement, only a slight higher overall value than pawns, excluding pawns iminent of a promotion. Continuing with the rooks, we see that the rook is a good attacking piece, indicated by a save rook infiltration. The Queen is powerfull almost everywhere, only the outer back rank squares (lower left and right) tend to reduce her value. This is rooted in the queen being there is a sign for being pushed by enemy pieces. Leading to a lot of squares being controled by the enemy hindering one own movement. Finally, the king, given the goal of the game is to checkmate the king, a save position for the king is very valuable. This is seen by the back rank (rank $1$) being the only nonpenalized squares. Furthermore, the most save squares are the castling target squares ($g1$ and $c1$) as well as the $b1$ square. Shifting the king over to $b1$ is quite common protecting the $a2$ pawn providing a complete protected pawn shield infront of the king.










Due to the rules of chess, given any encoded legal position $\ten{X}$, there are some elements which are never set. This are the elements corresponding to pawns on the 1st and 8th rank. This implies that the matrix $\mat{T}_2$ in \eqref{eq:tstat} under model \eqref{eq:quadraticexpfam} can \emph{not} be the identity as, for simplicity, assumed in \cref{sec:ising_estimation}. The addaptation in this case is simple. We only need to enforce the corresponding diagonal elements in $\mat{\Omega}$ to be zero, which is a linear operation. The only difference in the estimation procedure is to use $\ten{G}_2$ in \cref{thm:grad} for gradient computations. This can be extended to incorporate additional constraints into the model derived from the game of chess. First, all interactions of pieces with pawns on the 1st or 8th rank are impossible. Next, there is always exactly one king per color on the board and they can not be on neighboring squares. Finally, only one piece can occupy a square. All those conditions can be writen as a linear constraint on $\mat{\Omega}$ by setting the corresponding entries in $\mat{\Omega}$ to zero.










By the raw scale of the data, millions of observations, it is computationaly infeasible to compute the gradients on the entire data set. Simply using a computationaly managable subset is not an option. The problem is that the Ising model is of dimension $8\cdot 8\cdot 12 = 768$ which requires an absurd amount of data to get a reasonable picture of the interaction effects. The solution is to switch from a classic gradient based optimization to a stochastic version. This means that every gradient update uses a new random subset of the entire data set. Therefore, we draw independent random samples of the data. The independence of samples derived from the independence of games, and every sample is drawn from a different game.










%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%





\appendix




