Chapter 5.4 Recent Advancements

Transformer architecture has profoundly revolutionized AI and has broad impacts. As classical computing approaches its physical limitations, it is important to ask how we can leverage quantum computers to advance Transformers with better performance and energy efficiency. Besides the fault-tolerant quantum Transformers, multiple works are advancing this frontier from various perspectives.

One aspect is to design novel quantum neural network architectures introduced in Chapter 4 with the intuition from the Transformer, especially the design of the self-attention block. In particular, @li2023quantumselfattentionneuralnetworks proposes the quantum self-attention neural networks and verifies their effectiveness with the synthetic data sets of quantum natural language processing. There are several follow-up works along this direction [@Cherrat_2024; @evans2024learningsasquatchnovelvariational; @widdows2024quantumnaturallanguageprocessing].

Another research direction is exploring how to utilize quantum processors to advance certain parts of the transformer architecture. Specifically, @gao2023fastquantumalgorithmattention considers how to compute the self-attention matrix under sparse assumption and shows a quadratic quantum speedup. @liu2024quantumcircuitbasedcompressionperspective harnesses quantum neural networks to generate weight parameters for the classical model. In addition, @liu2024towards devises a quantum algorithm for the training process of large-scale neural networks, implying an exponential speedup under certain conditions.

Despite the progress made, several important questions remain unresolved. Among them, one key challenge is devising efficient methods to encode classical data or parameters onto quantum computers. Currently, most quantum algorithms can only implement one or at most constant layers of Transformers [@guo2024quantumlinear2024; @liao2024gptquantumcomputer; @khatri2024quixerquantumtransformermodel] without quantum tomography. Are there effective methods that can be generalized to multiple layers, or is achieving this even necessary? Moreover, given the numerous variants of the classical Transformer architecture, can these variants also benefit from the capabilities of quantum computers? Lastly, if one considers training a model directly on a quantum computer, is it possible to do so in a “quantum-native” manner—avoiding excessive data read-in and read-out operations?