Chapter 3.3 Theoretical Foundations

This content of this section corresponds to the Chapter 3.3 of our paper. Please refer to the original paper for more details.

Theoretical Foundations of Quantum Kernel Machines

In this section, we take a step further to explore the theoretical foundations of quantum kernels. Specifically, we focus on two crucial aspects: the expressivity and generalization properties of quantum kernel machines. As shown in Figure 3.4, these two aspects are essential for understanding the potential advantages of quantum kernels over classical learning approaches and their inherent limitations. For ease of understanding, this section emphasizes the fundamental concepts necessary for evaluating the power and limitation of quantum kernels instead of exhaustively reviewing all theoretical results.

Expressivity of quantum kernel machines

Quantum kernels, as discussed before, are constructed by explicitly defining quantum feature mappings. In this context, the expressivity of quantum kernel machines refers to the types of functions that quantum feature mappings can approximate and the kinds of correlations that quantum kernels can effectively model.

Following the conventions of [@gil2024expressivity], we demonstrate that any kernel function can be approximated using finitely deep quantum circuits by showing that the associated feature mapping can also be approximated using quantum circuits. This conclusion rests on two key theoretical foundations: Mercer’s feature space construction and the universality of quantum circuits. Together, these principles establish the theoretical feasibility of realizing any kernel function as a quantum kernel.

It is important to note that if exact mathematical equality were required, Mercer’s construction would demand an infinite-dimensional Hilbert space, which in turn would require quantum computers with infinitely many qubits—an impractical scenario. However, in practical applications, we are more interested in approximating functions to a certain level of precision rather than achieving exact evaluations. This perspective makes it feasible to implement the corresponding quantum feature mappings using a finite number of qubits. The following theorem confirms that any kernel function can be approximated as a quantum kernel to arbitrary precision with finite computational resources (We defer the proof details at the end of this subsection).

Theorem 3.2. Let $k:\mathcal{X} \times \mathcal{X} \to \mathbb{R}$ be a kernel function. Then, for any $\varepsilon \ge 0$ there exists $N \in \mathbb{N}$ and a quantum feature mapping $\rho_N$ onto the Hilbert space of quantum states of $N$ qubits such that $$ |k(\bm{x},\bm{x}’)-2^N \text{Tr}(\rho_N(\bm{x})\rho_N(\bm{x})’) + 1| < \varepsilon$$ for almost all $\bm{x},\bm{x}’\in \mathcal{X}$.

Theorem 3.2, instead of discussing the $\varepsilon$-approximation of quantum kernels in the form $|k(\bm{x},\bm{x}’)- \text{Tr}(\rho_N(\bm{x})\rho_N(\bm{x})’)| < \varepsilon$, introduces additional multiplicative and additive factors, expressed as $|k(\bm{x},\bm{x}’)-2^N \text{Tr}(\rho_N(\bm{x})\rho_N(\bm{x})’) + 1| < \varepsilon$. These additional factors, explained below, do not impede the universality of the theorem.

Moreover, it signifies that the inequality is valid “except on sets of measure zero,” or equivalently “with probability 1.” In other words, while adversarial instances of $\bm{x},\bm{x}’\in \mathcal{X}$ may exist for which the inequality does not hold, such instances are so sparse that the probability of encountering them when sampling from the relevant probability distribution is zero.

Last, Theorem 3.2 establishes that any kernel function can be approximated as a quantum kernel up to a multiplicative and an additive factor using a finite number of qubits.

We remark that Theorem 3.2 does not aim to demonstrate any quantum advantage but rather establishes the ultimate expressivity of quantum kernels. The theorem guarantees the existence of a quantum kernel using a finite number of qubits but does not address how quickly the number of required qubits grows with increasing computational complexity of the kernel function $k$ or with decreasing approximation error $\varepsilon > 0$. The number of qubits $N$ will depend on certain properties of the kernel $k$ and the approximation error $\varepsilon$. For instance, if the required number of qubits scales exponentially with these parameters, Theorem 3.2 would have limited practical utility. Similarly, the time required to find such a quantum kernel approximation-independent of the memory and runtime requirements for preparing the feature vectors and computing their inner product—must also be considered.

Although Theorem 3.2 establishes that all kernel functions can be realized as quantum kernels, there may still exist kernel functions that cannot be realized efficiently as quantum kernels. This observation requires us to identify quantum kernels that can be computed efficiently on quantum computers, i.e., in polynomial time.

Generalization of quantum kernel machines

Generalization, which quantifies the ability of learning models (both classical and quantum) to predict unseen data, is a critical metric for evaluating the quality of a learning model. Due to its importance, here we analyze the potential advantages of quantum kernels in terms of the generalization ability.

For comprehensive, in this section, we first elucidate the generalization error bounds for general kernel machines, establishing a unified framework for a fair comparison between quantum kernels and classical kernels. Subsequently, we introduce a geometry metric to assess the potential quantum advantage of quantum kernels with respect to generalization error for a fixed amount of training data.

Generalization error bound for kernel machines

We begin by reviewing the optimal learning models based on the specified kernel machines which could be either classical or quantum. Suppose we have obtained $n$ training examples $\{(\bm{x}^{(i)},y^{(i)})\}_{i=1}^n$ with $\bm{x}^{(i)}\in \mathbb{R}^d$ and $y^{(i)}=f(\bm{x}^{(i)}) \in \mathbb{R}$, where $f$ is the target function.

After training on this data, there exists a machine learning algorithm that outputs $h(\bm{x})=\bm{\omega}^{\dagger} \cdot \phi(\bm{x})$, where $\phi(\bm{x}) \in \mathbb{C}^D$ refers to the hidden feature map corresponding the classical/quantum kernel function $k(\bm{x}^{(i)},\bm{x}^{(j)})= \bm{K}_{ij}=\phi(\bm{x}^{(i)}) \cdot \phi(\bm{x}^{(j)})$. More precisely, considering the mean square error as the loss function for such a task, we have

$$ \mathcal{L}(\bm{\omega}, \bm{x})= \lambda \bm{\omega}^{\dagger} \cdot \bm{\omega} + \sum_{i=1}^n \left( \bm{\omega}^{\dagger} \cdot \phi(\bm{x}^{(i)}) - y^{(i)}\right)^2,$$

where $\lambda \ge 0$ is the regularization parameters for avoiding over-fitting.

Then the optimal parameters for optimizing this loss function refer to $$ \bm{\omega}^* = \arg\min_{\bm{\omega}\in \Theta} \mathcal{L}(\bm{\omega}, \bm{x}).$$ As discussed in Chapter 1.1.2{reference-type=“ref” reference=“chapt3:subsec:dual_rep”}, the optimal solution $\bm{\omega}^*$ has the explicit form of

$$\bm{w}^* = \bm{\Phi}^{\dagger} (\bm{K}+\lambda \mathbb{I})^{-1} \bm{Y} $$

$$=\sum_{i=1}^n\sum_{j=1}^n \phi(\bm{x}^{(i)}) ((\bm{K}+\lambda \mathbb{I})^{-1})_{ij} \bm{y}^{(j)},$$

where $\bm{Y}=[y^{(1)}, …, y^{(n)}]^{\top}$ refers to the vector of labels and $K$ is the kernel matrix, and the second equality follows that $\bm{\Phi} = [\phi(\bm{x}^{(1)}),\cdots, \phi(\bm{x}^{(n)})]^{\dagger}$. Moreover, the norm of the optimal parameters has a simple form for the case of $\lambda \to 0$, i.e., $$|\bm{w}^*|_2^2= \bm{Y}^{\top} K^{-1} \bm{Y}.$$

We now expose the prediction error of these learning models, i.e.,

$$\epsilon_{\bm{\omega}^*}(\bm{x}) = \left| f(\bm{x}) - (\bm{\omega}^*)^{\dagger} \cdot \bm{\phi}(\bm{x}) \right|,$$

which is uniquely determined by the kernel matrix $K$ and the hyper-parameter $\lambda$. In particular, we will focus on discussing the upper bound on the expected prediction error, which is the sum of training error and generalization error.

We now will separately give a rough derivation of the upper bound of training error and generalization error, which present the necessary steps for the derivation for a clear exposition and omit the specific details that could be found in @huang2021power.

$\bullet$ Training error. Employing the convexity of function and Jensen’s inequality, the training error yields $$\begin{aligned} \frac{1}{n} \sum_{i=1}^n \epsilon_{\bm{\omega}^*}(\bm{x}^{(i)}) \le \sqrt{\frac{1}{n} \sum_{i=1}^n \left( (\bm{\omega}^*)^{\dagger} \cdot \phi(\bm{x}^{(i)}) - y^{(i)} \right)^2 }. \end{aligned}$$ Moreover, combining with the expression for the optimal $\bm{\omega}^*$, we can obtain the upper bound of training error in terms of the kernel matrix $K$ and hyper-parameter $\lambda$, i.e., $$\frac{1}{n} \sum_{i=1}^n \epsilon_{\bm{\omega}^*}(\bm{x}^{(i)}) \le \sqrt{\frac{\lambda^2 \bm{Y}^{\top} (K+\lambda \mathbb{I})^{-2}\bm{Y}}{n} }.$$ We can see that when $\lambda=0$ and $K$ are invertible, the training error is zero. However, the hyper-parameter is usually set as $\lambda > 0$ in practice.

$\bullet$ Generalization error. The upper bound of generalization error in terms of the kernel matrix is given as,

For the case of $\lambda=0$, the first term in the generalization error bound has a simple form of $5\cdot \bm{Y}^{\top} K^{-1} \bm{Y}/n$.

By obtaining the upper bound of training and generalization error, we can directly get the prediction error of the learning model for a specific kernel matrix. We summarize these three errors below.