Generalization in quantum machine learning from few training data | Panda Anku


Let us first outline our theoretical framework. We consider a quantum machine learning model (QMLM) as being a parameterized quantum channel, i.e., a completely positive trace preserving (CPTP) map that is parameterized. We denote a QMLM as ({{{{{{{{mathcal{E}}}}}}}}}_{{{{{{{{boldsymbol{alpha }}}}}}}}}^{{{{{{{{rm{QMLM}}}}}}}}}(cdot )) where α = (θ, k) denotes the set of parameters, including continuous parameters θ inside gates, as well as discrete parameters k that allow the gate structure to vary. We make no further assumptions on the form of the dependence of the CPTP map ({{{{{{{{mathcal{E}}}}}}}}}_{{{{{{{{boldsymbol{alpha }}}}}}}}}^{{{{{{{{rm{QMLM}}}}}}}}}(cdot )) on the parameters α. During the training process, one would optimize the continuous parameters θ and potentially also the structure k of the QMLM.

A QMLM takes input data in the form of quantum states. For classical data x, the input is first encoded in a quantum state via a map xρ(x). This allows the data to be either classical or quantum in nature, since regardless it is eventually encoded in a quantum state. We assume that the data encoding is fixed in advance and not optimized over. We remark here that our results also apply for more general encoding strategies involving data re-uploading55, as we explain in Supplementary Note 3.

For the sake of generality, we allow the QMLM to act on a subsystem of the state ρ(x). Hence, the output state can be written as (({{{{{{{{mathcal{E}}}}}}}}}_{{{{{{{{boldsymbol{alpha }}}}}}}}}^{{{{{{{{rm{QMLM}}}}}}}}}otimes {{{{{{mathrm{id}}}}}}})(rho (x))). For a given data point (xi, yi), we can write the loss function as

$$ell ({{{{{{{boldsymbol{alpha }}}}}}}};{x}_{i},{y}_{i})={{{{{{{rm{Tr}}}}}}}}left[{O}_{{x}_{i},{y}_{i}}^{{{{{{{{rm{loss}}}}}}}}}left({{{{{{{{mathcal{E}}}}}}}}}_{{{{{{{{boldsymbol{alpha }}}}}}}}}^{{{{{{{{rm{QMLM}}}}}}}}}otimes {{{{{{mathrm{id}}}}}}}right)(rho ({x}_{i}))right],$$


for some Hermitian observable ({O}_{{x}_{i},{y}_{i}}^{{{{{{{{rm{loss}}}}}}}}}). As is common in classical learning theory, the prediction error bounds will depend on the largest (absolute) value that the loss function can attain. In our case, we therefore assume ({C}_{{{{{{{{rm{loss}}}}}}}}}:=mathop{sup }nolimits_{x,y}|vert {O}_{x,y}^{{{{{{{{rm{loss}}}}}}}}}|vert < infty < ?A3B2 tal? > ), i.e., the spectral norm can be bounded uniformly over all possible loss observables.

In Eq. (1), we take the measurement to act on a single copy of the output of the QMLM ({{{{{{{{mathcal{E}}}}}}}}}_{{{{{{{{boldsymbol{alpha }}}}}}}}}^{{{{{{{{rm{QMLM}}}}}}}}}(cdot )) upon input of (a subsystem of) the data encoding state ρ(xi). At first this looks like a restriction. However, note that one can choose ({{{{{{{{mathcal{E}}}}}}}}}_{{{{{{{{boldsymbol{alpha }}}}}}}}}^{{{{{{{{rm{QMLM}}}}}}}}}(cdot )) to be a tensor product of multiple copies of a QMLM, each with the same parameter setting, applied to multiple copies of the input state. Hence our framework is general enough to allow for global measurements on multiple copies. In this addition to the aforementioned situation, we further study the case in which trainable gates are more generally reused.

For a training dataset (S={{({x}_{i},{y}_{i})}}_{i=1}^{N}) of size N, the average loss for parameters α on the training data is

$${hat{R}}_{S}({{{{{{{boldsymbol{alpha }}}}}}}})=frac{1}{N}mathop{sum }limits_{i=1}^{N}ell ({{{{{{{boldsymbol{alpha }}}}}}}};{x}_{i},{y}_{i}),$$


which is often referred to as the training error. When we obtain a new input x, the prediction error of a parameter setting α is taken to be the expected loss

$$R({{{{{{{boldsymbol{alpha }}}}}}}})=mathop{{mathbb{E}}}limits_{(x,y) sim P}left[ell ({{{{{{{boldsymbol{alpha }}}}}}}};x,y)right],$$


where the expectation is with respect to the distribution P from which the training examples are generated.

Achieving small prediction error R(α) is the ultimate goal of (quantum) machine learning. As P is generally not known, the training error ({hat{R}}_{S}({{{{{{{boldsymbol{alpha }}}}}}}})) is often taken as a proxy for R(α). This strategy can be justified via bounds on the generalization error

$${{{{{{mathrm{gen}}}}}}},({{{{{{{boldsymbol{alpha }}}}}}}})=R({{{{{{{boldsymbol{alpha }}}}}}}})-{hat{R}}_{S}({{{{{{{boldsymbol{alpha }}}}}}}}),$$


which is the key quantity that we bound in our theorems.

Analytical results

We prove probabilistic bounds on the generalization error of a QMLM. Our bounds guarantee that a good performance on a sufficiently large training data set implies, with high probability, a good performance on previously unseen data points. In particular, we provide a precise meaning of “sufficiently large” in terms of properties of the QMLM and the employed training procedure.

Figure 1 gives an overview of the different scenarios considered in this work. We begin with the basic form of our result. We consider a QMLM that has arbitrarily many non-trainable global quantum gates and T trainable local quantum gates. Here, by local we mean κ-local for some n-independent locality parameter κ, and a local quantum gate can be a unitary or a quantum channel acting on κ qubits. Then we have the following bound on the generalization error for the QMLM with final parameter setting α* after training:

Fig. 1: Various types of Quantum Machine Learning Models (QMLMs).
figure 1

Panel (a) depicts a basic QMLM with T = 6 independently parameterized gates. The gray boxes illustrate some global evolutions that are not trainable. Panel (b) shows a gate-sharing QMLM with T = 2 independently parameterized gates, each gate is repeatedly used for M = 3 times. In panel (c), we depict a multi-copy QMLM. We take measurement data from M rounds of a basic QMLM with T = 6 parameterized gates and post-process the measurement outcomes to produce an output. Running M copies of a basic QMLM with T gates is equivalent to running a gate-sharing QMLM with T = 6 parameterized gates, in which each gate is repeated M times. Panel (d) describes a gate-sharing QMLM under optimization. The parameterized gate to the left undergoes a small change, while the one to the right undergoes a large change. If we sort the changes Δ1, Δ2 from large to small, then Δ1 Δ2 ≈ 0. Finally, panel (e) illustrates gate-sharing QMLM with variable structure. The number T of parameterized gates changes throughout the optimization. The figure begins with T = 1 and ends with T = 2.

Theorem 1

(Basic QMLM). For a QMLM with T parameterized local quantum channels, with high probability over training data of size N, we have that

$${{{{{{mathrm{gen}}}}}}},({{{{{{{{boldsymbol{alpha }}}}}}}}}{*})in {{{{{{{mathcal{O}}}}}}}}left(sqrt{frac{Tlog T}{N}}right).$$


Remark 1

Theorem 1 directly implies sample complexity bounds: For any ε > 0, we can, with high success probability, guarantee that gen(α*) ε, already with training data of size (N sim Tlog T/{varepsilon }^{2}), which scales effectively linearly with T, the number of parameterized gates.

For efficiently implementable QMLMs with (Tin {{{{{{{mathcal{O}}}}}}}}({{{{{{{rm{poly}}}}}}}}n)), a sample size of (Nin {{{{{{{mathcal{O}}}}}}}}left({{{{{{{rm{poly}}}}}}}}n/{varepsilon }^{2}right)) is already sufficient. More concretely, if (Tin {{{{{{{mathcal{O}}}}}}}}left({n}^{D}right)) for some degree D, then the corresponding sufficient sample complexity obtained from Theorem 1 satisfies (Nin tilde{{{{{{{{mathcal{O}}}}}}}}}left({n}^{D}/{varepsilon }^{2}right)), where the (tilde{{{{{{{{mathcal{O}}}}}}}}}) hides factors logarithmic in n. In the NISQ era56, we expect the number T of trainable maps to only grow mildly with the number of qubits, e.g., as in the architectures discussed in refs. 18, 45, 57. In this case, Theorem 1 gives an especially strong guarantee.

In various QMLMs, such as QCNNs, the same parameterized local gates are applied repeatedly. One could also consider running the same QMLM multiple times to gather measurement data and then post-processing that data. In both cases, one should consider the QMLM as using the same parameterized local gates repeatedly. We assume each gate to be repeated at most M times. A direct application of Theorem 1 would suggest that we need a training data size N of roughly MT, the total number of parameterized gates. However, the required number of training data actually is much smaller:

Theorem 2

(Gate-sharing QMLM). Consider a QMLM with T independently parameterized local quantum channels, where each channel is reused at most M times. With high probability over training data of size N, we have

$${{{{{{mathrm{gen}}}}}}},({{{{{{{{boldsymbol{alpha }}}}}}}}}{*})in {{{{{{{mathcal{O}}}}}}}}left(sqrt{frac{Tlog (MT)}{N}}right).$$


Thus, good generalization, as in Remark 1, can already be guaranteed, with high probability, when the data size effectively scales linearly in T (the number of independently parameterized gates) and only logarithmically in M (the number of uses). In particular, applying multiple copies of the QMLM in parallel does not significantly worsen the generalization performance compared to a single copy. Thus, as we discuss in Supplementary Note 3, Theorem 2 ensures that we can increase the number of shots used to estimate expectation values at the QMLM output without substantially harming the generalization behavior.

The optimization process of the QMLM also plays an important role in the generalization performance. Suppose that during the optimization process, the tth local gate changed by a distance Δt. We can bound the generalization error by a function of the changes ({{{{{Delta }}}_{t}}}_{t}).

Theorem 3

(Gate-sharing QMLM under optimization). Consider a QMLM with T independently parameterized local quantum channels, where the tth channel is reused at most M times and is changed by Δt during the optimization. Assume Δ1≥…≥ΔT. With high probability over training data of size N, we have

$${{{{{{mathrm{gen}}}}}}},({{{{{{{{boldsymbol{alpha }}}}}}}}}{*})in {{{{{{{mathcal{O}}}}}}}}left(mathop{min }limits_{K=0,ldots,T}left{sqrt{frac{Klog (MT)}{N}}+mathop{sum }limits_{k=K+1}^{T}M{{{Delta }}}_{k}right}right).$$


When only KT local quantum gates have undergone a significant change, then the generalization error will scale at worst linearly with K and logarithmically in the total number of parameterized gates MT. Given that recent numerical results suggest that the parameters in a deep parameterized quantum circuit only change by a small amount during training58,59, Theorem 3 may find application in studying the generalization behavior of deep QMLMs.

Finally, we consider a more advanced type of variable ansatz optimization strategy that is also adopted in practice60,61,62,63. Instead of fixing the structure of the QMLM, such as the number of parameterized gates and how the parameterized gates are interleaved with the fixed gates, the optimization algorithm could vary the structure, e.g., by adding or deleting parameterized gates. We assume that for each number T of parameterized gates, there are GT different QMLM architectures.

Theorem 4

(Gate-sharing QMLM with variable structure). Consider a QMLM with an arbitrary number of parameterized local quantum channels, where for each T > 0, we have GT different QMLM architectures with T parameterized gates. Suppose that after optimizing on the data, the QMLM has T independently parameterized local quantum channels, each repeated at most M times. Then, with high probability over input training data of size N,

$${{{{{{mathrm{gen}}}}}}},({{{{{{{{boldsymbol{alpha }}}}}}}}}{*})in {{{{{{{mathcal{O}}}}}}}}left(sqrt{frac{Tlog (MT)}{N}}+sqrt{frac{log ({G}_{T})}{N}}right).$$


Thus, even if the QMLM can in principle use exponentially many parameterized gates, we can control the generalization error in terms of the number of parameterized gates used in the QMLM after optimization, and the dependence on the number of different architectures is only logarithmic. This logarithmic dependence is crucial as even in the cases when GT grows exponentially with T, we have (log ({G}_{T})/Nin {{{{{{{mathcal{O}}}}}}}}(T/N)).

Numerical results

In this section we present generalization error results obtained by simulating the following two QML implementations: (1) using a QCNN to classify states belonging to different quantum phases, and (2) training a parameterized quantum circuit to compile a quantum Fourier transform matrix.

We begin with the quantum phase classification application. The QCNN architecture introduced in45 generalizes the model of (classical) convolutional neural networks with the goal of performing pattern recognition on quantum data. It is composed of so-called convolutional and pooling layers, which alternate. In a convolutional layer, a sequence of translationally invariant parameterized unitaries on neighbouring qubits is applied in parallel, which works as a filter between feature maps in different layers of the QCNN. Then, in the pooling layers, a subset of the qubits are measured to reduce the dimensionality of the state while preserving the relevant features of the data. Conditioned on the corresponding measurement outcomes, translationally invariant parameterized 1-qubit unitaries are applied. The QCNN architecture has been employed for supervised QML tasks of classification of phases of matter and to devise quantum error correction schemes45. Moreover, QCNNs have been shown not to exhibit barren plateaus, making them a generically trainable QML architecture27.

The action of a QCNN can be considered as mapping an input state ρin to an output state ρout given as ({rho }_{{{mbox{out}}}}={{{{{{{{mathcal{E}}}}}}}}}_{{{{{{{{boldsymbol{alpha }}}}}}}}}^{{{{{{{{rm{QCNN}}}}}}}}}({rho }_{{{mbox{in}}}})). Then, given ρout, one measures the expectation value of a task-specific Hermitian operator.

In our implementation, we employ a QCNN to classify states belonging to different symmetry protected topological phases. Specifically, we consider the generalized cluster Hamiltonian

$$H=mathop{sum }limits_{j=1}^{n}left({Z}_{j}-{J}_{1}{X}_{j}{X}_{j+1}-{J}_{2}{X}_{j-1}{Z}_{j}{X}_{j+1}right),$$


where Zi (Xi) denote the Pauli z (x) operator acting on qubit i, and where J1 and J2 are tunable coupling coefficients. As proved in64, and as schematically shown in Fig. 2, the ground-state phase diagram of the Hamiltonian of Eq. (9) has four different phases: symmetry-protected topological (I), ferromagnetic (II), anti-ferromagnetic (III), and trivial (IV). In the Methods section, we provide additional details regarding the classical simulation of the ground states of H.

Fig. 2: Generalization performance of quantum phase recognition.
figure 2

We employed the QCNN architecture for quantum phase recognition on ground states of the generalized cluster Hamiltonian H of Eq. (9). We evaluated the phase assigned by the QCNN to a point in the J1J2-plane by sampling 8192 computational basis measurement outcomes and taking the least frequent outcome as the predicted phase. Panel (a) visualizes the performance of the QCNN for 16-qubits, trained with 30 data points, which were labelled according to the analytically determined phase diagram. Blue crosses denote training data points (not all 30 are shown). Blue (red) circles represent correctly (incorrectly) classified points. Panel (b) shows that, as the training data size increases, the training accuracy quickly becomes a good predictor for the testing accuracy on 10,000 randomly sampled points, i.e., the dependence of testing accuracy on training accuracy is approximately linear with slope increasing with N. The different points in the plot correspond to different parameter settings in the QCNN throughout the optimization. The dotted gray line shows the baseline accuracy of 25% achieved by random guessing. Panel (c) shows that the improvement in the slope with growing training data size is similar for different numbers of qubits, reflecting the at-worst polylogarithmic dependence of N on n predicted by our bounds.

By sampling parameters in the (J1, J2) plane, we create a training set ({{(left|{psi }_{i}rightrangle,{y}_{i})}}_{i=1}^{N}) composed of ground states (left|{psi }_{i}rightrangle) of H and their associated labels yi. Here, the labels are in the form of length-two bit strings, i.e., yi {0, 1}2, where each possible bit string corresponds to a phase that (left|{psi }_{i}rightrangle) can belong to. The QCNN maps the n-qubit input state (left|{psi }_{i}rightrangle) to a 2-qubit output state. We think of the information about the phase as being encoded into the output state by which of the 4 computational basis effect operators is assigned the smallest probability. Namely, we define the loss function as (ell ({{{{{{{boldsymbol{alpha }}}}}}}};left|{psi }_{i}rightrangle,{y}_{i}):=leftlangle {y}_{i}right|{{{{{{{{mathcal{E}}}}}}}}}_{{{{{{{{boldsymbol{alpha }}}}}}}}}^{{{{{{{{rm{QCNN}}}}}}}}}(left|{psi }_{i}rightrangle leftlangle {psi }_{i}right|)left|, {y}_{i}rightrangle). This leads to an empirical risk given by

$${hat{R}}_{S}({{{{{{{boldsymbol{alpha }}}}}}}})=frac{1}{N}mathop{sum }limits_{i=1}^{N}langle, {y}_{i}|{{{{{{{{mathcal{E}}}}}}}}}_{{{{{{{{boldsymbol{alpha }}}}}}}}}^{{{{{{{{rm{QCNN}}}}}}}}}(vert {psi }_{i} rangle langle {psi }_{i}|)|{y}_{i} rangle .$$


In Fig. 2, we visualize the phase classification performance achieved by our QCNN, trained according to this loss function, while additionally taking the number of misclassified points into account. Moreover, we show how the true risk, or rather the test accuracy as proxy for it, correlates well with the achieved training accuracy, already for small training data sizes. This is in agreement with our theoretical predictions, discussed in more detail in Supplementary Note 4, which for QCNNs gives a generalization error bound polylogarithmic in the number of qubits. We note that refs. 65, 66 observed similarly favorable training data requirements for a related task of learning phase diagrams.

Next, we turn our attention to the unitary compiling application. Compiling is the task of transforming a high-level algorithm into a low-level code that can be implemented on a device. Unitary compiling is a paradigmatic task in the NISQ era where a target unitary is compiled into a gate sequence that complies with NISQ device limitations, e.g., hardware-imposed connectivity and shallow depth to mitigate errors. Unitary compiling is crucial to the quantum computing industry, as it is essentially always performed prior to running an algorithm on a NISQ device, and various companies have their own commercial compilers67,68. Hence, any ability to accelerate unitary compiling could have industrial impact.

Here we consider the task of compiling the unitary U of the n-qubit Quantum Fourier Transform (QFT)69 into a short-depth parameterized quantum circuit V(α). For V(α) we employ the VAns (Variable Ansatz) algorithm62,70, which uses a machine learning protocol to iteratively grow a parameterized quantum circuit by placing and removing gates in a way that empirically leads to lower loss function values. Unlike traditional approaches that train just continuous parameters in a fixed structure circuit, VAns also trains discrete parameters, e.g., gate placement or type of gate, to explore the architecture hyperspace. In Supplementary Note 5, we apply our theoretical results in this compiling scenario.

The training set for compilation is of the form ({{left|{psi }_{i}rightrangle,Uleft|{psi }_{i}rightrangle }}_{i=1}^{N}), consisting of input states (left|{psi }_{i}rightrangle) and output states obtained through the action of U. The (left|{psi }_{i}rightrangle) are drawn independently from an underlying data-generating distribution. In our numerics, we consider three such distributions: (1) random computational basis states, (2) random (non-orthogonal) low-entangled states, and (3) Haar random n-qubit states. Note that states in the first two distributions are easy to prepare on a quantum computer, whereas states from the last distribution become costly to prepare as n grows. As the goal is to train V(α) to match the action of U on the training set, we define the loss function as the squared trace distance between (Uleft|{psi }_{i}rightrangle) and (V({{{{{{{boldsymbol{alpha }}}}}}}})left|{psi }_{i}rightrangle), i.e., (ell (alpha ;left|{psi }_{i}rightrangle,Uleft|{psi }_{i}rightrangle ):=vert|Uleft|{psi }_{i}rightrangle leftlangle {psi }_{i}right|{U}^{{{{dagger}}} }-V({{{{{{{boldsymbol{alpha }}}}}}}})left|{psi }_{i}rightrangle leftlangle {psi }_{i}right|V{({{{{{{{boldsymbol{alpha }}}}}}}})}^{{{{dagger}}} }|{|}_{1}^{2}). This leads to the empirical risk

$${hat{R}}_{S}({{{{{{{boldsymbol{alpha }}}}}}}})=frac{1}{N}mathop{sum }limits_{i=1}^{N}|vert Uleft|{psi }_{i}rightrangle leftlangle {psi }_{i}right|{U}^{{{{dagger}}} }-V({{{{{{{boldsymbol{alpha }}}}}}}})left|{psi }_{i}rightrangle leftlangle {psi }_{i}right|V{({{{{{{{boldsymbol{alpha }}}}}}}})}^{{{{dagger}}} }|{|}_{1}^{2},$$


where 1 indicates the trace norm.

Figure 3 shows our numerical results. As predicted by our analytical results, we can, with high success probability, accurately compile the QFT when training on a data set of size polynomial in the number of qubits. Our numerical investigation shows a linear scaling of the training requirements when training on random computational basis states. This better than the quadratic scaling implied by a direct application of our theory, which holds for any arbitrary data-generating distribution. Approximate implementations of QFT with a reduced number of gates71, combined with our results, could help to further study this apparent gap theoretically. When training on Haar random states, our numerics suggest that an even smaller number of training data points is sufficient for good generalization: Up to n = 9 qubits, we generalize well from a constant number of training data points, independent of the system size.

Fig. 3: Generalization performance of variational unitary compiling.
figure 3

We employed a variable structure QMLM (as discussed near Theorem 4). Panel (a) shows the dependence of ({N}_{min }), the minimum training data size for accurate compilation, on n, the number of qubits. Accurate compilation is defined as achieving (parallel U-V({{{{{{{boldsymbol{alpha }}}}}}}}){parallel }_{F}^{2} , < , 1{0}^{-5}) on 1 out of 8 (blue) or on 7 out of 8 (red) runs. For training data with random computational basis inputs (solid lines), ({N}_{min }) scales linearly in n. When training on examples with Haar random inputs (dashed lines), ({N}_{min }) is constant up to system size n = 9. In Panel (b), for n = 9 qubits, we plot the prediction error of successfully trained (training cost < 10−8) runs for 8 training data sets with N = 16 to N = 30 random computational basis inputs. Panel (c) shows the dependence of the testing error on the reciprocal of the training error for different training data sizes, in the case of 9 qubits. Here, the data consisted of random computational basis states and the corresponding outputs. As N increases, small training error becomes a more reliable predictor for small testing error. Each subplot shows 8 different training runs, trained on different training data sets.

Even more striking are our results when initializing close to the solution. In this case, as shown in Fig. 4, we find that two training data points suffice to obtain accurate generalization, which holds even up to a problem size of 40 qubits. Our theoretical results in Theorem 3 do predict reduced training requirements when initializing near the solution. Hence, the numerics are in agreement with the theory, although they paint an even more optimistic picture and suggest that further investigation is needed to understand why the training data requirements are so low. While the assumption of initialization near the solution is only viable assuming additional prior knowledge, it could be justified in certain scenarios. For example, if the unitaries to be compiled depend on a parameter, e.g., time, and if we have already compiled the unitary for one parameter setting, we might use this as initialization for unitaries with a similar parameter.

Fig. 4: Performance of variational unitary compiling when initializing near the solution.
figure 4

Each panel shows the results of a single randomly initialized training run, where we used randomly drawn low-entangled states for training. The testing error on 20 test states, which we allow to be more strongly entangled than the states used during training, is plotted versus the reciprocal of the training error for training data sizes N = 1, 2, for different system sizes n. A training data set of size N = 1 is not sufficient to guarantee good generalization: Even with decreasing training error, the testing error remains large. In contrast, assuming favorably initialized training, N = 2 training data points suffice for good generalization, even for up to n = 40 qubits.

Leave a Comment