The \emph{mixed-product} property for matrices $A, B, C, D$ holds if and only if the following matrix products are well defined
\begin{displaymath}
(A\otimes B)(C \otimes D) = (A C) \otimes (B C).
\end{displaymath}
In combination with the \emph{Hadamard product} (element-wise multiplication) for matrices $A, C$ of the same size as well as $B, D$ of the same size is
\begin{displaymath}
(A\otimes B)\circ (C \otimes D) = (A \circ C) \otimes (B \circ D).
\end{displaymath}
The \emph{transpose} of the Kronecker product fulfills
\begin{displaymath}
(A\otimes B)^T = A^T \otimes B^T
\end{displaymath}
\section{Distance Computation}
The pair-wise distances $d_V(X_{i,:}, X_{j,:})$ arranged in the distance matrix $D\in\mathbb{R}^{n\times n}$ can be written as
Next are $\bar{y}^{(m)}$ and the ``element-wise'' loss $l_i = L_n(V, X_i)$.
\begin{displaymath}
\bar{y}^{(m)} = W^T Y^m,\qquad l = \bar{y}^{(2)} - (\bar{y}^{(1)})^2
\end{displaymath}
\section{Gradient Computation}
The model
\begin{displaymath}
Y \sim g(B^T X) + \epsilon.
\end{displaymath}
Assume a data set $(X_i, Y_i)$ for $i =1, ..., n$ with $X$ a $n\times p$ matrix such that each row represents one sample. Now let $l_i = L_n(V, X_i)$, $\bar{y}^{(1)}_j =(W^T Y)_j$ as well as $d_{i j}, w_{i j}$ the distance and weight matrix components. Then the gradient for the ``simple'' CVE method is given as
This representation is cumbersome and a direct implementation has a asymptotic run-time of $\Theta(n^2p^2)$ because it is a double sum over $n$, therefore quadratic in $n$, and the form of $\nabla_V d_V$.
This can be optimized and written in matrix notation. First the distance gradient is given as
The relation $\vec(A)_k = a_{i,j}$ holds for $k=nj+i$ such that $0\leq k < n^2$ and $0\leq i < n, 0\leq j < m$. This operation is obviously a bijection. When going ``backwards'' the dimension of the original space is required, therefore let $\devec_n$ be the operation such that $\devec_n(\vec(A))= A$ for $A\in\mathbb{R}^{n\times m}$.\footnote{Note that for $B\in\mathbb{R}^{p\times q}$ such that $pq = nm$ the $\devec_n(\vec(B))\in\mathbb{R}^{n\times m}$.}
For symmetric matrices the information stored in $a_{i,j}= a_{j,i}$ is twice stored in $A=A^T\in\mathbb{R}^{n\times n}$, to remove this redundency the \emph{symmetric vectorization} is defined which saves the main diagonal and the lower triangular part of the symmetric matrix according the scema
% A symmetric matrix with zero main diagonal, meaning a matrix $S = S^T$ with $S_{i,i} = 0,\ \forall i = 1,..,n$ is given in the following form
% \begin{displaymath}
% S = \begin{pmatrix}
% 0 & s_{1,0} & s_{2,0} & \ldots & s_{n-1,0} \\
% s_{1,0} & 0 & s_{2,1} & \ldots & s_{n-1,1} \\
% s_{2,0} & s_{2,1} & 0 & \ldots & s_{n-1,2} \\
% \vdots & \vdots & \vdots & \ddots & \vdots \\
% s_{n-1,0} & s_{n-1,1} & s_{n-1,2} & \ldots & 0
% \end{pmatrix}
% \end{displaymath}
% Therefore its sufficient to store only the lower triangular part, for memory efficiency and some further algorithmic shortcuts (sometime they are more expensive) the symmetric matrix $S$ is stored in packed form, meaning in a vector of the length $\frac{n(n-1)}{2}$. We use (like for matrices) a column-major order of elements and define the $\vecl:\Sym(n)\to \mathbb{R}^{n(n-1) / 2}$ operator defined as
This matrix is \underline{not} symmetric but we can consider the symmetric $S + S^T$ with a zero main diagonal because $D$ has a zero main diagonal, meaning $s_{i i}=0$ because $d_{i i}=0$ for each $i$. Therefore the following holds due to the fact that $\nabla_V d_V(X_{i,:}, X_{j,:})=\nabla_V d_V(X_{j,:}, X_{i,:})$.
Note the summation indices $0\leq j \leq i < n$. Substitution with $\nabla_V d_V(X_{i,:}, X_{j,:})=-2(X_{i,:}- X_{j,:})^T(X_{i,:}- X_{j,:}) V$ evaluates to
Let $X_{-}$ be the matrix containing all pairs of $X_{i,:}$ to $X_{j,:}$ differences using the same row indexing scheme as the symmetric vectorization.
\begin{displaymath}
(X_{-})_{k,:} = X_{i,:} - X_{j,:}\quad\text{for}\quad k = n j + i - \frac{j(j + 1)}{2}, 0\leq j \leq i < n^2
\end{displaymath}
With the $X_{-}$ matrix the above double sum can be formalized in matrix notation as follows\footnote{only valid cause $s_{i i}=0$}
\begin{displaymath}
L_n(V) = -\frac{2}{nh^2} X_{-}^T(\svec(\sym(S)) \circ_r X_{-}) V
\end{displaymath}
where $\circ_r$ means the ``recycled'' hadamard product, this is for a vector $x\in\mathbb{R}^n$ and a Matrix $M\in\mathbb{R}^{n\times m}$ just the element wise multiplication for each column of $M$ with $x$, or equivalent $x\circ_r M =\underbrace{(x, x, ..., x)}_{{n\times m}}\circ M$ where $\circ$ is the element-wise product.