effie changes (abstract / intro)

This commit is contained in:
Daniel Kapla 2024-03-28 14:19:24 +01:00
parent 0cb7772132
commit 8fd80522f0
2 changed files with 2119 additions and 802 deletions

File diff suppressed because it is too large Load Diff

View File

@ -272,10 +272,9 @@
\maketitle \maketitle
\begin{abstract} \begin{abstract}
We consider regression or classification problems where the independent variable is matrix- or tensor valued. Modeling the inverse regression as a member of the quadratic exponential family, we derive a multilinear sufficient reduction for the regression or classification problem. Using manifold theory, we prove the consistency and asymptotic normality of the sufficient reduction. For continuous We consider regression or classification problems where the independent variable is matrix- or tensor-valued. We derive a multi-linear sufficient reduction for the regression or classification problem modeling the conditional distribution of the predictors given the response as a member of the quadratic exponential family. Using manifold theory, we prove the consistency and asymptotic normality of the sufficient reduction. We develop estimation procedures of
tensor-valued predictors, we develop a computationally efficient estimation procedure of sufficient reductions for both continuous and binary tensor-valued predictors. For continuous predictors, the algorithm is highly computationally efficient and is also applicable to situations where the dimension of
their sufficient reductions, which is also applicable to situations where the dimension of the reduction exceeds the sample size. We demonstrate the superior performance of our approach in simulations and real-world data examples for both continuous and binary tensor-valued predictors. The \textit{Chess data} analysis results agree with a human player's understanding of the game and confirm the relevance of our approach.
the reduction exceeds the available sample size. An estimation procedure for binary tensor-valued data is also provided. We conclude with simulations and real world data examples for both continuous and binary tensor-valued predictors.
\end{abstract} \end{abstract}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@ -286,6 +285,16 @@ In Statistics, tensors are a mathematical tool to represent data of complex stru
Complex data are collected at different times and/or under several conditions often involving a large number of multi-indexed variables represented as tensor-valued data \parencite{KoldaBader2009}. They occur in large-scale longitudinal studies \parencite[e.g.][]{Hoff2015}, in agricultural experiments and chemometrics and spectroscopy \parencite[e.g.][]{LeurgansRoss1992,Burdick1995}. Also, in signal and video processing where sensors produce multi-indexed data, e.g. over spatial, frequency, and temporal dimensions \parencite[e.g.][]{DeLathauwerCastaing2007,KofidisRegalia2005}, in telecommunications \parencite[e.g.][]{DeAlmeidaEtAl2007}. Other examples of multiway data include 3D images of the brain, where the modes are the 3 spatial dimensions, and spatio-temporal weather imaging data, a set of image sequences represented as 2 spatial modes and 1 temporal mode. Complex data are collected at different times and/or under several conditions often involving a large number of multi-indexed variables represented as tensor-valued data \parencite{KoldaBader2009}. They occur in large-scale longitudinal studies \parencite[e.g.][]{Hoff2015}, in agricultural experiments and chemometrics and spectroscopy \parencite[e.g.][]{LeurgansRoss1992,Burdick1995}. Also, in signal and video processing where sensors produce multi-indexed data, e.g. over spatial, frequency, and temporal dimensions \parencite[e.g.][]{DeLathauwerCastaing2007,KofidisRegalia2005}, in telecommunications \parencite[e.g.][]{DeAlmeidaEtAl2007}. Other examples of multiway data include 3D images of the brain, where the modes are the 3 spatial dimensions, and spatio-temporal weather imaging data, a set of image sequences represented as 2 spatial modes and 1 temporal mode.
\begin{itemize}
\item Review \cite{ZhouLiZhu2013} and see how you compare with them. They focus on the forward regression model with a scalar response but they claim that "Exploit- ing the array structure in imaging data, the new method substantially reduces the dimensionality of imaging data, which leads to efficient estimation and prediction."
\item Read \cite{ZhouEtAl2023} to figure out the distribution they use for the tensor-valued predictors and briefly describe what they do.
\item Read \cite{RabusseauKadri2016} to figure out what they do. They seem to draw both the response and the predictors from tensor-normal with iid N(0,1) entries: "In order to leverage the tensor structure of the output data, we formulate the problem as the minimization of a least squares criterion subject to a multilinear rank constraint on the regression tensor. The rank constraint enforces the model to capture low-rank structure in the outputs and to explain dependencies between inputs and outputs in a low-dimensional multilinear subspace."
\item
\end{itemize}
Tensor regression models have been proposed to exploit the special structure of tensor covariates, e.g. \cite{HaoEtAl2021,ZhouLiZhu2013}, or tensor responses \cite{RabusseauKadri2016,LiZhang2017,ZhouEtAl2023} \cite{HaoEtAl2021} modeled a scalar response as a flexible nonparametric function of tensor covariates. \cite{ZhouLiZhu2013} assume the scalar response has a distribution in the exponential family given the tensor-valued predictors and model the link function as a multilinear function of the predictors. \cite{LiZhang2017} model the tensor-valued response as tensor normal. Rather than using $L_1$ type penalty functions to induce sparsity, they employ the envelope method (Cook, Li, and Chiaromonte Citation2010) to estimate the unknown regression coefficient. Moreover, the envelope method essentially identifies and uses the material information jointly. They develop an estimation algorithm and study the asymptotic properties of the estimator. the scalar response as These models try to utilize the sparse and low-rank structures in the tensors either in the regression coefficient tensor or the response tensor to boost performance on the regression task by reducing the number of free parameters.
Multilinear tensor normal models have been used in various applications, including medical imaging \parencite{BasserPajevic2007,DrydenEtAl2009}, spatio-temporal data analysis \parencite{GreenewaldHero2014}, regression analysis for longitudinal relational data \parencite{Hoff2015}. One of the most important uses of the multilinear normal (MLN) distribution, and hence tensor analysis, is perhaps in magnetic resonance imaging (MRI) \parencite{OhlsonEtAl2013}. A recent survey \parencite{WangEtAl2022} and references therein contain more information and potential applications of multilinear tensor normal models. Multilinear tensor normal models have been used in various applications, including medical imaging \parencite{BasserPajevic2007,DrydenEtAl2009}, spatio-temporal data analysis \parencite{GreenewaldHero2014}, regression analysis for longitudinal relational data \parencite{Hoff2015}. One of the most important uses of the multilinear normal (MLN) distribution, and hence tensor analysis, is perhaps in magnetic resonance imaging (MRI) \parencite{OhlsonEtAl2013}. A recent survey \parencite{WangEtAl2022} and references therein contain more information and potential applications of multilinear tensor normal models.
In this paper we present a model-based \emph{Sufficient Dimension Reduction} (SDR) method for tensor-valued data with distribution in the quadratic exponential family assuming a Kronecker product structure of the first and second moment. By generalizing the parameter space to embedded manifolds we obtain consistency and asymtotic normality results while allowing great modeling flexibility in the linear sufficient dimension reduction. In this paper we present a model-based \emph{Sufficient Dimension Reduction} (SDR) method for tensor-valued data with distribution in the quadratic exponential family assuming a Kronecker product structure of the first and second moment. By generalizing the parameter space to embedded manifolds we obtain consistency and asymtotic normality results while allowing great modeling flexibility in the linear sufficient dimension reduction.
@ -885,7 +894,7 @@ for every non-empty compact $K\subset\Xi$. Then, there exists a strong M-est
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Simulations}\label{sec:simulations} \section{Simulations}\label{sec:simulations}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
In this section we report simulation results for the tensor normal and the Ising model where different aspects of the GMLM model are compared against other methods. The comparison methods are Tensor Sliced Inverse Regression (TSIR) \parencite{DingCook2015}, Multiway Generalized Canonical Correlation Analysis (MGCCA) \parencite{ChenEtAl2021,GirkaEtAl2024} and the Tucker decomposition that is a higher-order form of principal component analysis (HOPCA) \textcite{KoldaBader2009}, for both continuous and binary data. For the latter, the binary values are treated as continuous. As a base line we also include classic PCA on vectorized observations. In case of the Ising model, we also compare with LPCA (Logistic PCA) and CLPCA (Convex Logistic PCA), both introduced in \textcite{LandgrafLee2020}. All experiments are performed at sample size $n = 100, 200, 300, 500$ and $750$. Every experiment is repeated $100$ times. In this section we report simulation results for the tensor normal and the Ising model where different aspects of the GMLM model are compared against other methods. The comparison methods are Tensor Sliced Inverse Regression (TSIR) \parencite{DingCook2015}, Multiway Generalized Canonical Correlation Analysis (MGCCA) \parencite{ChenEtAl2021,GirkaEtAl2024} and the Tucker decomposition that is a higher-order form of principal component analysis (HOPCA) \textcite{KoldaBader2009}, for both continuous and binary data. For the latter, the binary values are treated as continuous. As part of our baseline analysis, we also incorporate traditional Principal Component Analysis (PCA) on vectorized observations. In the case of the Ising model, we also compare with LPCA (Logistic PCA) and CLPCA (Convex Logistic PCA), both introduced in \textcite{LandgrafLee2020}. All experiments are performed at sample sizes $n = 100, 200, 300, 500$ and $750$. Every experiment is repeated $100$ times.
We are interested in the quality of the estimate of the true sufficient reduction $\ten{R}(\ten{X})$ from \cref{thm:sdr}. Therefore, we compare with the true vectorized reduction matrix $\mat{B} = \bigkron_{k = r}^{1}\mat{\beta}_k$, as it is compatible with any linear reduction method. The distance $d(\mat{B}, \hat{\mat{B}})$ between $\mat{B}\in\mathbb{R}^{p\times q}$ and an estimate $\hat{\mat{B}}\in\mathbb{R}^{p\times \tilde{q}}$ is the \emph{subspace distance} which is proportional to We are interested in the quality of the estimate of the true sufficient reduction $\ten{R}(\ten{X})$ from \cref{thm:sdr}. Therefore, we compare with the true vectorized reduction matrix $\mat{B} = \bigkron_{k = r}^{1}\mat{\beta}_k$, as it is compatible with any linear reduction method. The distance $d(\mat{B}, \hat{\mat{B}})$ between $\mat{B}\in\mathbb{R}^{p\times q}$ and an estimate $\hat{\mat{B}}\in\mathbb{R}^{p\times \tilde{q}}$ is the \emph{subspace distance} which is proportional to
\begin{displaymath} \begin{displaymath}
@ -930,7 +939,7 @@ The final tensor normal experiment 1e) is a misspecified model to explore the ro
\end{figure} \end{figure}
The results are visualized in \cref{fig:sim-normal}. Simulation 1a), given a 1D linear relation between the response $Y$ and $\ten{X}$, TSIR and GMLM are equivalent. This is expected as \textcite{DingCook2015} already established that TSIR gives the MLE estimate under a tensor (matrix) normal distributed setting. For the other methods, MGCCA is only a bit better than PCA which, unexpectedly, beats HOPCA. But none of them are close to the performance of TSIR or GMLM. Continuing with 1b), where we introduced a qubic relation between $Y$ and $\ten{X}$, we observe a bigger deviation in the performance of GMLM and TSIR. This is caused mainly because we are estimating an $8$ dimensional subspace now, which amplifies the small performance boost, in the subspace distance, we gain by avoiding slicing. The results of 1c) are surprising. The GMLM model behaves as expected, clearly being the best. The first surprise is that PCA, HOPCA and MGCCA are visually indistinguishable. This is explained by a high signal to noise ration in this particular example. But the biggest surprise is the failure of TSIR. Even more surprising is that the conditional distribution $\ten{X}\mid Y$ is tensor normal distributed which in conjunction with $\cov(\vec\ten{X})$ having a Kronecker structure, should give the MLE estimate. The low-rank assumption is also not an issue, this simply relates to TSIR estimating a 1D linear reduction which fulfills all the requirements. Finally, a common known issue of slicing, used in TSIR, is that conditional multi-modal distributions can cause estimation problems due to the different distribution modes leading to vanishing slice means. Again, this is not the case in simulation 1c). The results are visualized in \cref{fig:sim-normal}. Simulation 1a), given a 1D linear relation between the response $Y$ and $\ten{X}$, TSIR and GMLM are equivalent. This is expected as \textcite{DingCook2015} already established that TSIR gives the MLE estimate under a tensor (matrix) normal distributed setting. For the other methods, MGCCA is only a bit better than PCA which, unexpectedly, beats HOPCA. But none of them are close to the performance of TSIR or GMLM. Continuing with 1b), where we introduced a cubic relation between $Y$ and $\ten{X}$, we observe a bigger deviation in the performance of GMLM and TSIR. This is caused mainly because we are estimating an $8$ dimensional subspace now, which amplifies the small performance boost, in the subspace distance, we gain by avoiding slicing. The GMLM model in 1c) behaves as expected, clearly being the best. The other results are surprising. First, PCA, HOPCA and MGCCA are visually indistinguishable. This is explained by a high signal-to-noise ratio in this particular example. But the biggest surprise is the failure of TSIR. Even more surprising is that the conditional distribution $\ten{X}\mid Y$ is tensor normal distributed which, in conjunction with $\cov(\vec\ten{X})$ having a Kronecker structure, should give the MLE estimate. The low-rank assumption is also not an issue, this simply relates to TSIR estimating a 1D linear reduction which fulfills all the requirements. Finally, a common known issue of slicing, used in TSIR, is that conditional multi-modal distributions can cause estimation problems due to the different distribution modes leading to vanishing slice means. Again, this is not the case in simulation 1c).
An investigation into this behaviour revealed the problem in the estimation of the mode covariance matrices $\mat{O}_k = \E[(\ten{X} - \E\ten{X})_{(k)}\t{(\ten{X} - \E\ten{X})_{(k)}}]$. The mode wise reductions provided by TSIR are computed as $\hat{\mat{O}}_k^{-1}\hat{\mat{\Gamma}}_k$ where the poor estimation of $\hat{\mat{O}}_k$ causes the failure of TSIR. The poor estimate of $\mat{O}_k$ is rooted in the high signal to noise ratio in this particular simulation. GMLM does not have degenerate behaviour for high signal to noise ratios but it is less robust in low signal to noise ratio setting where TSIR performs better in this specific example. An investigation into this behaviour revealed the problem in the estimation of the mode covariance matrices $\mat{O}_k = \E[(\ten{X} - \E\ten{X})_{(k)}\t{(\ten{X} - \E\ten{X})_{(k)}}]$. The mode wise reductions provided by TSIR are computed as $\hat{\mat{O}}_k^{-1}\hat{\mat{\Gamma}}_k$ where the poor estimation of $\hat{\mat{O}}_k$ causes the failure of TSIR. The poor estimate of $\mat{O}_k$ is rooted in the high signal to noise ratio in this particular simulation. GMLM does not have degenerate behaviour for high signal to noise ratios but it is less robust in low signal to noise ratio setting where TSIR performs better in this specific example.
Simulation 1d), incorporating information about the covariance structure behaves similar to 1b), except that GMLM gains a statistically significant lead in estimation performance. The last simulation, 1e), where the model was misspecified for GMLM. GMLM, TSIR as well as MGCCA are on par where GMLM has a sligh lead in the small sample size setting and MGCCA overtakes in higher sample scenarios. The PCA and HOPCA methods both still outperformed. A wrong assumption about the relation to the response is still better than no relation at all. Simulation 1d), incorporating information about the covariance structure behaves similar to 1b), except that GMLM gains a statistically significant lead in estimation performance. The last simulation, 1e), where the model was misspecified for GMLM. GMLM, TSIR as well as MGCCA are on par where GMLM has a sligh lead in the small sample size setting and MGCCA overtakes in higher sample scenarios. The PCA and HOPCA methods both still outperformed. A wrong assumption about the relation to the response is still better than no relation at all.
@ -1056,7 +1065,7 @@ In \cref{tab:eeg} we provide the AUC and its standard deviation. For all applied
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Chess}\label{sec:chess} \subsection{Chess}\label{sec:chess}
The data set is provided by the \citetitle{lichess-database}\footnote{\fullcite{lichess-database}}. We downloaded November of 2023 consisting of more than $92$ million games. We removed all games without position evaluations. The evaluations, also denoted scores, are from Stockfish\footnote{\fullcite{stockfish}}, a free and strong chess engine. The scores take the role of the response $Y$ and correspond to a winning probability from whites point of few. Positive scores are good for white and negative scores indicate an advantage for black. We ignore all highly unbalanced positions, which we set to be positions with absolute score above $5$. We also remove all positions with a mate score (one side can force check mate). Furthermore, we only consider positions after $10$ half-moves to avoid oversampling the beginning of the most common openings including the start position which is in every game. Finally, we only consider positions with white to move. This leads to a final data set of roughly $64$ million positions, including duplicates. The data set is provided by the \citetitle{lichess-database}\footnote{\fullcite{lichess-database}}. We randomly selected the November of 2023 data that consist of more than $92$ million games. We removed all games without position evaluations. The evaluations, also denoted as scores, are from Stockfish\footnote{\fullcite{stockfish}}, a free and strong chess engine. The scores take the role of the response $Y$ and correspond to a winning probability from the white pieces point of view. Positive scores are good for white and negative scores indicate an advantage for black pieces. We ignore all highly unbalanced positions, which we set to be positions with absolute score above $5$. We also remove all positions with a mate score (one side can force checkmate). Furthermore, we only consider positions after $10$ half-moves to avoid oversampling the beginning of the most common openings including the start position which is in every game. Finally, we only consider positions with white to move. This leads to a final data set of roughly $64$ million positions, including duplicates.
A chess position is encoded as a set of $12$ binary matrices $\ten{X}_{\mathrm{piece}}$ of dimensions $8\times 8$. Every binary matrix encodes the positioning of a particular piece by containing a $1$ if the piece is present at the corresponding board position. The $12$ pieces derive from the $6$ types of pieces, namely pawns (\pawn), knights (\knight), bishops (\bishop), queens (\queen) and kings (\king) of two colors, black and white. See \cref{fig:fen2tensor} for a visualization. A chess position is encoded as a set of $12$ binary matrices $\ten{X}_{\mathrm{piece}}$ of dimensions $8\times 8$. Every binary matrix encodes the positioning of a particular piece by containing a $1$ if the piece is present at the corresponding board position. The $12$ pieces derive from the $6$ types of pieces, namely pawns (\pawn), knights (\knight), bishops (\bishop), queens (\queen) and kings (\king) of two colors, black and white. See \cref{fig:fen2tensor} for a visualization.
@ -1119,10 +1128,13 @@ If for every piece type ($6$ types, \emph{not} distinguishing between color) hol
\caption{\label{fig:psqt}Extracted PSQTs (piece square tables) from the chess example GMLM reduction.} \caption{\label{fig:psqt}Extracted PSQTs (piece square tables) from the chess example GMLM reduction.}
\end{figure} \end{figure}
The first visual effect in \cref{fig:psqt} is the dark blue PSQT of the Queen followed by a not so dark Rook PSQT. This indicated that the Queen, followed by the Rook, are the most value pieces (after the king, but a king piece value makes no sense). The next two are the Knight and Bishop which have higher value than the Pawns, ignoring the $6$th and $7$th rank as this makes the pawns a potential queen. This is the classic piece value order known in chess. The first visual effect in \cref{fig:psqt} is the dark blue PSQT of the Queen followed by a not-so-dark Rook PSQT. This indicates that the Queen, followed by the Rook, are the most valuable pieces (after the king which is the most valuable, which also implies that assigning value to the king makes no sense). The next two are the Knight and Bishop which have higher value than the Pawns, ignoring the $6$th and $7$th rank as this makes the pawns potential queens. This is the classic piece value order known in chess.
Next, goint one by one through the PSQTs, a few words about the prefered positions for every piece type. The pawn positions are specifically good on the $6$th and especially on the $7$th rank as this threatens a promotion to a Queen (or Knight, Bishop, Rook). The Knight PSQT is a bit surprising, the most likely explanation for the knight being good in the enemy territory is that it got there by capturing an enemy piece for free. A common occurency in low rated games which is a big chunk of the training data, ranging over all levels. The Bishops sem to have no specific prefered placement, only a slight higher overall value than pawns, excluding pawns iminent of a promotion. Continuing with the rooks, we see that the rook is a good attacking piece, indicated by a save rook infiltration. The Queen is powerfull almost everywhere, only the outer back rank squares (lower left and right) tend to reduce her value. This is rooted in the queen being there is a sign for being pushed by enemy pieces. Leading to a lot of squares being controled by the enemy hindering one own movement. Finally, the king, given the goal of the game is to checkmate the king, a save position for the king is very valuable. This is seen by the back rank (rank $1$) being the only non-penalized squares. Furthermore, the most save squares are the castling target squares ($g1$ and $c1$) as well as the $b1$ square. Shifting the king over to $b1$ is quite common protecting the $a2$ pawn providing a complete protected pawn shield infront of the king. Next, going over the PSQTs one by one, a few words about the preferred positions for every piece type. The pawn positions are specifically good on the $6$th and especially on the $7$th rank as this threatens a promotion to a Queen (or Knight, Bishop, Rook). The Knight PSQT is a bit surprising, the most likely explanation for the knight being good in the enemy territory is that it got there by capturing an enemy piece for free. A common occurrence in low-rated games is a big chunk of the training data, ranging over all levels. The Bishops seem to have no specific preferred placement, only a slightly higher overall value than pawns, excluding pawns imminent for a promotion. Continuing with the rooks, we see that the rook is a good attacking piece, indicated by a save rook infiltration. \footnote{Rook infiltration is a strategic concept in chess that involves skillfully maneuvering your rook to penetrate deep into your opponents territory.} The Queen is powerful almost everywhere, only the outer back rank squares (lower left and right) tend to reduce her value. This is rooted in the queen's presence there being a sign for being chased by enemy pieces. Leading to a lot of squares being controlled by the enemy hindering one own movement. Finally, given the goal of the game is to checkmate the king, a safe position for the king is very valuable. This is seen by the back rank (rank $1$) being the only non-penalized squares. Furthermore, the safest squares are the castling \footnote{Castling is a maneuver that combines king safety with rook activation.} target squares ($g1$ and $c1$) as well as the $b1$ square. Shifting the king over to $b1$ is quite common to protect the $a2$ pawn so that the entire pawn shield in front of the king is protected.
The results of our analysis in the previous paragraph agree with the configuration of the chess board most associated with observed chess game outcomes. This arrangement also aligns with the understanding of human chess players of an average configuration at any moment during the game.
\section{Discussion}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%