Transformer (deep learning architecture)

A transformer is a deep learning architecture developed by Google and based on the multi-head attention mechanism, proposed in a 2017 paper "Attention Is All You Need".^[1] Text is converted to numerical representations called tokens, and each token is converted into a vector via looking up from a word embedding table.^[1] At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished. The transformer paper, published in 2017, is based on the softmax-based attention mechanism proposed by Bahdanau et al. in 2014 for machine translation.^[2]^[3]

Transformers have the advantage of having no recurrent units, and therefore require less training time than earlier recurrent neural architectures (RNNs) such as long short-term memory (LSTM).^[4] Later variations have been widely adopted for training large language models (LLM) on large (language) datasets, such as the Wikipedia corpus and Common Crawl.^[5]

Transformers are currently used in large-scale natural language processing, computer vision (vision transformers), audio,^[6] multi-modal processing, robotics,^[7] and even playing chess.^[8] It has also led to the development of pre-trained systems, such as generative pre-trained transformers (GPTs)^[9] and BERT^[10] (Bidirectional Encoder Representations from Transformers).

History

Predecessors

Sequence modelling and generation had been done with plain recurrent neural networks for many years. An early well-cited example was the Elman network (1990). In theory, the information from one token can propagate arbitrarily far down the sequence, but in practice the vanishing-gradient problem leaves the model's state at the end of a long sentence without precise, extractable information about preceding tokens.

One key component of the attention mechanism is to include neurons that multiply the outputs of other neurons. Such neurons were called multiplicative units,^[11] and neural networks using multiplicative units were called sigma-pi networks^[12] or higher-order networks,^[13] but they faced high computational complexity.^[4] A key breakthrough was LSTM (1995),^{[note 1]} which incorporated multiplicative units into a recurrent network, as well as other innovations that prevented the vanishing gradient problem, and allowed efficient learning of long-sequence modelling. It became the standard architecture for long sequence modelling until the 2017 publication of Transformers.

However, LSTM did not solve a general problem that recurrent networks usually^{[note 2]} have, which is that it cannot operate in parallel over all tokens in a sequence. It must operate one at a time from the first token to the last. The fast weight controller (1992)^[14] was an early attempt to bypass the difficulty. It used the fast weights architecture (1987),^[15] where one neural network outputs the weights of another neural network. It was later shown to be equivalent to the linear Transformer without normalization.^[16]^[17] Schmidhuber used the terminology "learning internal spotlights of attention" in 1993,^[18] and now claims it was a precursor to what is now known as the attention mechanism, but Geoffrey Hinton disputes this claim of priority.^[19]

Attention with seq2seq

The idea of encoder-decoder sequence transduction had been developed in the early 2010s (see ^[20]^[21] for previous papers). The papers most commonly cited as the originators that produced seq2seq are two papers from 2014.^[20]^[21]

(Sutskever et al, 2014) ^[21], a 380M-parameter seq2seq model for machine translation using two long short-term memory (LSTM). The architecture consists of two parts. The encoder is an LSTM that takes in a sequence of tokens and turns it into a vector. The decoder is another LSTM that converts the vector into a sequence of tokens. Another 130M-parameter model used a gated recurrent units (GRU).^[20] Later research showed that GRUs are neither better nor worse than LSTMs for seq2seq.^[22]^[23]

These early seq2seq models had no attention mechanism, and the state vector is accessible only after the last word of the source text was processed. Although in theory such a vector retains the information about the whole original sentence, in practice the information is poorly preserved, since the input is processed sequentially by one recurrent network into a fixed-size output vector, which was then processed by another recurrent network into an output. If the input is long, then the output vector would not be able to contain all relevant information, and the output quality degrades. As evidence, reversing the input sentence improved seq2seq translation.^[24]

(Bahdanau et al, 2014)^[2] introduced an attention mechanism to seq2seq for machine translation to solve the bottleneck problem, allowing the model to process long-distance dependencies more easily. They called their model RNNsearch, as it "emulates searching through a source sentence during decoding a translation".

(Luong et al, 2015)^[25] compared the relative performance of global (that of (Bahdanau et al, 2014)) and local (sliding window) attention model architectures for machine translation, and found that a mixed attention architecture had higher quality than global attention, while the use of a local attention architecture reduced translation time.

In 2016, Google Translate was revamped to Google Neural Machine Translation, which replaced the previous model based on statistical machine translation. The new model was a seq2seq model where the encoder and the decoder were both 8 layers of bidirectional LSTM.^[26] It took nine months to develop, and it achieved a higher level of performance than the statistical approach, which took ten years to develop.^[27]

Attention

Seq2seq models with attention still suffered from the same issue with recurrent networks, which is that they are hard to parallelize, which prevented them to be accelerated on GPUs. In 2016, decomposable attention applied attention mechanism to the feedforward network, which are easy to parallelize.^[28] One of its authors, Jakob Uszkoreit, suspected that attention without recurrence is sufficient for language translation, thus the title "attention is all you need".^[29]

In 2017, the original (100M-sized) encoder-decoder transformer model was proposed in the "Attention is all you need" paper. At the time, the focus of the research was on improving seq2seq for machine translation, by removing its recurrence to processes all tokens in parallel, but preserving its dot-product attention mechanism to keep its text processing performance.^[1] Its parallelizability was an important factor to its widespread use in large neural networks.^[30]

Transformer boom

Transformers were used in many models that contributed to the AI boom.

In language modelling, ELMo (2018) was a bi-directional LSTM that produces contextualized word embeddings, improving upon the line of research from bag of words and word2vec. It was followed by BERT (2018), an encoder-only Transformer model.^[31] In 2019 October, Google started using BERT to process search queries.^[32] In 2020, Google Translate replaced the previous RNN-encoder–RNN-decoder model by a Transformer-encoder–RNN-decoder model.^[33]

Starting in 2018, the OpenAI GPT series of decoder-only Transformers became state of the art in natural language generation. In 2022, a chatbot based on GPT-3, ChatGPT, became unexpectedly popular,^[34] triggering a boom around large language models.^[35]^[36]

Since 2020, Transformers were applied to more modalities than text, including the vision transformer,^[37] speech recognition,^[38] robotics,^[39] and multimodal^[40]. The vision transformer, in turn, stimulated new developments in convolutional neural networks.^[41] Image and video generators like DALL-E (2021), Stable Diffusion 3 (2024)^[42], and Sora (2024), were based on the Transformer architecture.

Training

Methods for stabilizing training

The plain transformer architecture had difficulty converging. In the original paper^[1] the authors recommended using learning rate warmup. That is, the learning rate should linearly scale up from 0 to maximal value for the first part of the training (usually recommended to be 2% of the total number of training steps), before decaying again.

A 2020 paper found that using layer normalization before (instead of after) multiheaded attention and feedforward layers stabilizes training, not requiring learning rate warmup.^[43]

Pretrain-finetune

Transformers typically undergo self-supervised learning involving unsupervised pretraining followed by supervised fine-tuning. Pretraining is typically done on a larger dataset than fine-tuning, due to the limited availability of labeled training data. Tasks for pretraining and fine-tuning commonly include:

The T5 transformer report^[44] documents a large number of pretraining tasks. Some examples are:

restoring corrupted text: Thank you <X> me to your party <Y> week. -> <X> for inviting <Y> last <Z> where the <Z> means "end of output".
translation: translate English to German: That is good. -> Das ist gut..
judging the grammatical acceptability of a sentence (CoLA sentence): The course is jumping well. -> not acceptable.

Tasks

In general, there are four classes of language modelling tasks: "masked", "autoregressive", and "prefixLM".^[45]^[46] These classes are independent of architectural choice. One does not need to use a Transformer for training language models to perform these tasks. However, they are often discussed in the context of Transformer.

In a masked task, one or more of the tokens is masked out, and the model would produce a probability distribution predicting what the masked-out tokens are based on the context. The loss function for the task is typically sum of log-perplexities for the masked-out tokens: ${\text{Loss}}=-\sum _{t\in {\text{masked tokens}}}\ln({\text{probability of }}t{\text{ conditional on its context}})$ and the model is trained to minimize this loss function. BERT is trained for two tasks, one of which was masked token prediction.

In an autoregressive task, the entire sequence is masked at first, and the model would produce a probability distribution predicting what the first token is. The first token is then revealed, and the model would predict the second token, and so on. The loss function for the task is still typically the same.

In a prefixLM task, the sequence is divided into two parts. The first part is presented as context, and the model would predict the first token of the second part. Then that would be revealed, and the model would predict the second token, and so on. The loss function for the task is still typically the same.

Note that "masked" as in "masked language modelling" is not "masked" as in "masked attention", and "prefixLM" (prefix language modeling) is not "prefixLM" (prefix language model).

Applications

The transformer has had great success in natural language processing (NLP), for example the tasks of machine translation and time series prediction. Many large language models such as GPT-2, GPT-3, GPT-4, Claude, BERT, XLNet, RoBERTa and ChatGPT demonstrate the ability of transformers to perform a wide variety of such NLP-related tasks, and have the potential to find real-world applications. These may include:

machine translation
document summarization
document generation
named entity recognition (NER)^[47]
biological sequence analysis
writing computer code based on requirements expressed in natural language.
video understanding.

In addition to the NLP applications, it has also been successful in other fields, such as computer vision, or the protein folding applications (such as AlphaFold). It was also applied to evaluating chess board positions. Using static evaluation alone (that is, with no Minimax search) it was able to achieve an Elo of 2895, putting it at grandmaster level.^[8]

Implementations

The transformer model has been implemented in standard deep learning frameworks such as TensorFlow and PyTorch.

Transformers is a library produced by Hugging Face that supplies transformer-based architectures and pretrained models.^[9]

Architecture

All transformers have the same primary components:

Tokenizers, which convert text into tokens.
Embedding layer, which converts tokens and positions of the tokens into vector representations.
Transformer layers, which carry out repeated transformations on the vector representations, extracting more and more linguistic information. These consist of alternating attention and feedforward layers. There are two major types of transformer layers: encoder layers and decoder layers, with further variants.
(optional) Un-embedding layer, which converts the final vector representations back to a probability distribution over the tokens.

Tokenization

As the Transformer architecture natively processes numerical data, not text, there must be a translation between text and tokens. A token is an integer that represents a character, or a short segment of characters. On the input side, the input text is parsed into a token sequence. Similarly, on the output side, the output tokens are paresd back to text. The module doing the conversion between token sequences and texts is a tokenizer.

The set of all tokens is the vocabulary of the tokenizer. When faced with tokens outside the vocabulary, typically a special token is used, written as "[UNK]" for "unknown".

Some commonly used tokenizers are byte pair encoding, WordPiece, and SentencePiece.

Embedding

Each token is converted into an embedding vector via a lookup table. Then, the positional encoding vectors are added to their respective token embedding vectors, producing the sequence of input vectors.

Positional encoding

A positional encoding is a fixed-size vector representation of the relative positions of tokens within a sequence: it provides the transformer model with information about where the words are in the input sequence. Without positional encoding, the model would be unable to process input sequence as more than a bag of words, as for example, both "man bites dog" and "dog bites man" would be processed exactly the same way.

The positional encoding is defined as a function of type $f:\mathbb {R} \to \mathbb {R} ^{d};d\in \mathbb {Z} ,d>0$ , where $d$ is a positive even integer. The full positional encoding – as defined in the original paper – is given by the equation: $(f(t)_{2k},f(t)_{2k 1})=(\sin(\theta ),\cos(\theta ))\quad \forall k\in \{0,1,\ldots ,d/2-1\}$ where $\theta ={\frac {t}{r^{k}}},r=N^{2/d}$ .

Here, $N$ is a free parameter that should be significantly larger than the biggest $k$ that would be input into the positional encoding function. In the original paper,^[1] the authors chose $N=10000$ .

The function is in a simpler form when written as a complex function of type $f:\mathbb {R} \to \mathbb {C} ^{d/2}$ $f(t)=\left(e^{it/r^{k}}\right)_{k=0,1,\ldots ,{\frac {d}{2}}-1}$ where $r=N^{2/d}$ .

The main reason the authors chose this as the positional encoding function is that it allows one to perform shifts as linear transformations: $f(t \Delta t)=\mathrm {diag} (f(\Delta t))f(t)$ where $\Delta t\in \mathbb {R}$ is the distance one wishes to shift. This allows the transformer to take any encoded position, and find the encoding of the position n-steps-ahead or n-steps-behind, by a matrix multiplication.

By taking a linear sum, any convolution can also be implemented as linear transformations: $\sum _{j}c_{j}f(t \Delta t_{j})=\left(\sum _{j}c_{j}\,\mathrm {diag} (f(\Delta t_{j}))\right)f(t)$ for any constants $c_{j}$ . This allows the transformer to take any encoded position and find a linear sum of the encoded locations of its neighbors. This sum of encoded positions, when fed into the attention mechanism, would create attention weights on its neighbors, much like what happens in a convolutional neural network language model. In the author's words, "we hypothesized it would allow the model to easily learn to attend by relative position."

In typical implementations, all operations are done over the real numbers, not the complex numbers, but since complex multiplication can be implemented as real 2-by-2 matrix multiplication, this is a mere notational difference.

Encoder-decoder (overview)

Like earlier seq2seq models, the original transformer model used an encoder-decoder architecture. The encoder consists of encoding layers that process the input tokens iteratively one layer after another, while the decoder consists of decoding layers that iteratively process the encoder's output as well as the decoder output's tokens so far.

The function of each encoder layer is to generate contextualized token representations, where each representation corresponds to a token that "mixes" information from other input tokens via self-attention mechanism. Each decoder layer contains two attention sublayers: (1) cross-attention for incorporating the output of encoder (contextualized input token representations), and (2) self-attention for "mixing" information among the input tokens to the decoder (i.e., the tokens generated so far during inference time).^[48]^[49]

Both the encoder and decoder layers have a feed-forward neural network for additional processing of the outputs and contain residual connections and layer normalization steps.^[49] These feed-forward layers contain a majority of parameters in a Transformer model.

Scaled dot-product attention

The transformer building blocks are scaled dot-product attention units. For each attention unit, the transformer model learns three weight matrices: the query weights $W_{Q}$ , the key weights $W_{K}$ , and the value weights $W_{V}$ . For each token $i$ , the input token representation $x_{i}$ is multiplied with each of the three weight matrices to produce a query vector $q_{i}=x_{i}W_{Q}$ , a key vector $k_{i}=x_{i}W_{K}$ , and a value vector $v_{i}=x_{i}W_{V}$ . Attention weights are calculated using the query and key vectors: the attention weight $a_{ij}$ from token $i$ to token $j$ is the dot product between $q_{i}$ and $k_{j}$ . The attention weights are divided by the square root of the dimension of the key vectors, ${\sqrt {d_{k}}}$ , which stabilizes gradients during training, and passed through a softmax which normalizes the weights. The fact that $W_{Q}$ and $W_{K}$ are different matrices allows attention to be non-symmetric: if token $i$ attends to token $j$ (i.e. $q_{i}\cdot k_{j}$ is large), this does not necessarily mean that token $j$ will attend to token $i$ (i.e. $q_{j}\cdot k_{i}$ could be small). The output of the attention unit for token $i$ is the weighted sum of the value vectors of all tokens, weighted by $a_{ij}$ , the attention from token $i$ to each token.

The attention calculation for all tokens can be expressed as one large matrix calculation using the softmax function, which is useful for training due to computational matrix operation optimizations that quickly compute matrix operations. The matrices $Q$ , $K$ and $V$ are defined as the matrices where the $i$ th rows are vectors $q_{i}$ , $k_{i}$ , and $v_{i}$ respectively. Then we can represent the attention as ${\begin{aligned}{\text{Attention}}(Q,K,V)={\text{softmax}}\left({\frac {QK^{\mathrm {T} }}{\sqrt {d_{k}}}}\right)V\end{aligned}}$

where the softmax is applied over each of the rows of the matrix.

Multi-head attention

One set of $\left(W_{Q},W_{K},W_{V}\right)$ matrices is called an attention head, and each layer in a transformer model has multiple attention heads. While each attention head attends to the tokens that are relevant to each token, multiple attention heads allow the model to do this for different definitions of "relevance". In addition, the influence field representing relevance can become progressively dilated in successive layers. Many transformer attention heads encode relevance relations that are meaningful to humans. For example, some attention heads can attend mostly to the next word, while others mainly attend from verbs to their direct objects.^[50] The computations for each attention head can be performed in parallel, which allows for fast processing. The outputs for the attention layer are concatenated to pass into the feed-forward neural network layers.

Concretely, let the multiple attention heads be indexed by $i$ , then we have ${\text{MultiheadedAttention}}(Q,K,V)={\text{Concat}}_{i\in [\#heads]}({\text{Attention}}(XW_{i}^{Q},XW_{i}^{K},XW_{i}^{V}))W^{O}$ where the matrix $X$ is the concatenation of word embeddings, and the matrices $W_{i}^{Q},W_{i}^{K},W_{i}^{V}$ are "projection matrices" owned by individual attention head $i$ , and $W^{O}$ is a final projection matrix owned by the whole multi-headed attention head.

Masked attention

It may be necessary to cut out attention links between some word-pairs. For example, the decoder, when decoding for the token position $t$ , should not have access to the token at position $t 1$ . This may be accomplished before the softmax stage by adding a mask matrix $M$ that is $-\infty$ at entries where the attention link must be cut, and $0$ at other places: ${\begin{aligned}{\text{MaskedAttention}}(Q,K,V)={\text{softmax}}\left(M {\frac {QK^{\mathrm {T} }}{\sqrt {d_{k}}}}\right)V\end{aligned}}$ For example, the following mask matrix is used in autoregressive modeling: $M={\begin{bmatrix}0&-\infty &-\infty &\dots &-\infty \\0&0&-\infty &\dots &-\infty \\0&0&0&\dots &-\infty \\\vdots &\vdots &\vdots &\ddots &\vdots \\0&0&0&\dots &0\end{bmatrix}}$ In words, it means that each token can pay attention to itself, and every token before it, but not any after it.

Encoder

An encoder consists of an embedding layer, followed by multiple encoder layers.

Each encoder layer consists of two major components: a self-attention mechanism and a feed-forward layer. It takes an input as a sequence of input vectors, applies the self-attention mechanism, to produce an intermediate sequence of vectors, then applies the feed-forward layer for each vector individually. Schematically, we have: ${\begin{aligned}{\text{given input vectors }}&h_{0},h_{1},\dots \\{\text{combine them into a matrix }}H&={\begin{bmatrix}h_{0}\\h_{1}\\\vdots \end{bmatrix}}\\{\text{EncoderLayer}}(H)&={\begin{bmatrix}{\text{FFN}}({\text{MultiheadedAttention}}(H,H,H)_{0})\\{\text{FFN}}({\text{MultiheadedAttention}}(H,H,H)_{1})\\\vdots \end{bmatrix}}\\\end{aligned}}$

where ${\text{FFN}}$ stands for "feed-forward network". We can more succinctly write it as ${\text{EncoderLayer}}(H)={\text{FFN}}({\text{MultiheadedAttention}}(H,H,H))$ with the implicit convention that the ${\text{FFN}}$ is applied to each row of the matrix individually.

The encoder layers are stacked. The first encoder layer takes the sequence of input vectors from the embedding layer, producing a sequence of vectors. This sequence of vectors is processed by the second encoder, and so on. The output from the final encoder layer is then used by the decoder.

As the encoder processes the entire input all at once, every token can attend to every other token (all-to-all attention), so there is no need for causal masking.

Decoder

A decoder consists of an embedding layer, followed by multiple decoder layers, followed by an un-embedding layer.

Each decoder consists of three major components: a causally masked self-attention mechanism, a cross-attention mechanism, and a feed-forward neural network. The decoder functions in a similar fashion to the encoder, but an additional attention mechanism is inserted which instead draws relevant information from the encodings generated by the encoders. This mechanism can also be called the encoder-decoder attention.^[1]^[49]

Like the first encoder, the first decoder takes positional information and embeddings of the output sequence as its input, rather than encodings. The transformer must not use the current or future output to predict an output, so the output sequence must be partially masked to prevent this reverse information flow.^[1] This allows for autoregressive text generation. For decoding, all-to-all attention is inappropriate, because a token cannot attend to tokens not yet generated. Thus, the self-attention module in the decoder is causally masked.

In contrast, the cross-attention mechanism attends to the output vectors of the encoder, which is computed before the decoder starts decoding. Consequently, there is no need for masking in the cross-attention mechanism.

Schematically, we have: ${\begin{aligned}H'&={\text{FFN}}({\text{MaskedMultiheadedAttention}}(H,H,H))\\{\text{DecoderLayer}}(H)&={\text{FFN}}({\text{MultiheadedAttention}}(H',H^{E},H^{E}))\end{aligned}}$ where $H^{E}$ is the matrix with rows being the output vectors from the encoder.

The last decoder is followed by a final un-embedding layer, that is, a linear transformation followed by a softmax, to produce the output probabilities over the vocabulary. Then, one of the tokens is sampled according to the probability, and the decoder can be run again to produce the next token, etc, autoregressively generating output text.

Full transformer architecture

The encoder and decoder layers are as described. Multiple encoder and decoder layers are composed into an encoder-decoder configuration, as illustrated.

The final points of detail are the residual connections and layer normalization (LayerNorm, or LN), which while conceptually unnecessary, are necessary for numerical stability and convergence. That is, the output of each sublayer is $\mathrm {LayerNorm} (x \mathrm {Sublayer} (x))$ where $\mathrm {Sublayer} (x)$ is the function implemented by the sublayer itself.

Each encoder layer contains two sublayers: the self-attention and the feedforward network. Each decoder layer contains four sublayers: the causally masked self-attention, the feedforward network, the cross-attention, and the feedforward network.

Terminology

The Transformer architecture, being modular, allows variations. Several common variations are described here.^[51]

An "encoder-only" Transformer applies the encoder to map an input text into a sequence of vectors that represent the input text. This is usually used for text embedding and representation learning for downstream applications. BERT is encoder-only. They are found to be not significantly better than training an encoder-decoder Transformer, then taking just the encoder.^[52]

A "decoder-only" Transformer is not literally decoder-only, since without an encoder, the cross-attention mechanism has nothing to attend to. Thus, the decoder layers in a decoder-only Transformer is composed of just two sublayers: the causally masked self-attention, the feedforward network. This is usually used for text generation and instruction following. The models in the GPT series are decoder-only.

An "encoder-decoder" Transformer is generally the same as the original Transformer, with 2 sublayers per encoder layer and 4 sublayers per decoder layer, etc. They might have minor architectural improvements, such as alternative activation functions, changing the location of normalization, etc. This is also usually used for text generation and instruction following. The models in the T5 series are encoder-decoder.^[51]

A "prefixLM" (prefix language model) is a decoder-only architecture, but with prefix masking, which is different from causal masking. Specifically, it has mask of the form^[51] $M_{\text{prefixLM}}={\begin{bmatrix}\mathbf {0} &0,-\infty \\\mathbf {0} &M_{\text{causal}}\end{bmatrix}}$ where the first columns correspond to the "prefix", and the subsequent columns correspond to the autoregressively generated text based on the prefix. They resemble encoder-decoder models, but has less "sparsity". Such models are rarely used, though they are cited as theoretical possibilities and benchmarked comparisons.^[52]

Subsequent work

Alternative activation functions

The original transformer uses ReLU activation function. Other activation functions were developed, such as SwiGLU.^[53]

Alternative positional encodings

Transformers may use other positional encoding methods than sinusoidal.^[54]

RoPE

RoPE (rotary positional embedding),^[55] is best explained by considering a list of 2-dimensional vectors $[(x_{1}^{(1)},x_{1}^{(2)}),(x_{2}^{(1)},x_{2}^{(2)}),(x_{3}^{(1)},x_{3}^{(2)}),...]$ . Now pick some angle $\theta$ . Then RoPE encoding is ${\text{RoPE}}{\big (}x_{m}^{(1)},x_{m}^{(2)},m{\big )}={\begin{pmatrix}\cos m\theta &-\sin m\theta \\\sin m\theta &\cos m\theta \end{pmatrix}}{\begin{pmatrix}x_{m}^{(1)}\\x_{m}^{(2)}\\\end{pmatrix}}={\begin{pmatrix}x_{m}^{(1)}\cos m\theta -x_{m}^{(2)}\sin m\theta \\x_{m}^{(2)}\cos m\theta x_{m}^{(1)}\sin m\theta \\\end{pmatrix}}$ Equivalently, if we write the 2-dimensional vectors as complex numbers $z_{m}:=x_{m}^{(1)} ix_{m}^{(2)}$ , then RoPE encoding is just multiplication by an angle: ${\text{RoPE}}{\big (}z_{m},m{\big )}=e^{im\theta }z_{m}$ For a list of $2n$ -dimensional vectors, a RoPE encoder is defined by a sequence of angles $\theta ^{(1)},...,\theta ^{(n)}$ . Then the RoPE encoding is applied to each pair of coordinates.

The benefit of RoPE is that the dot-product between two vectors depends on their relative location only:

${\text{RoPE}}{\big (}x,m{\big )}^{T}{\text{RoPE}}{\big (}y,n{\big )}={\text{RoPE}}{\big (}x,m k{\big )}^{T}{\text{RoPE}}{\big (}y,n k{\big )}$ for any integer $k$ .

ALiBi

ALiBi (Attention with Linear Biases)^[56] is not a replacement for the positional encoder on the original transformer. Instead, it is an additional positional encoder that is directly plugged into the attention mechanism. Specifically, the ALiBi attention mechanism is ${\begin{aligned}{\text{Attention}}(Q,K,V)={\text{softmax}}\left({\frac {QK^{\mathrm {T} }}{\sqrt {d_{k}}}} sB\right)V\end{aligned}}$ Here, $s$ is a real number ("scalar"), and $B$ is the linear bias matrix defined by $B={\begin{pmatrix}0&1&2&3&\cdots \\-1&0&1&2&\cdots \\-2&-1&0&1&\cdots \\-3&-2&-1&0&\cdots \\\vdots &\vdots &\vdots &\vdots &\ddots \\\end{pmatrix}}$ in other words, $B_{i,j}=j-i$ .

ALiBi allows pretraining on short context windows, then finetuning on longer context windows. Since it is directly plugged into the attention mechanism, it can be combined with any positional encoder that is plugged into the "bottom" of the entire network (which is where the sinusoidal encoder on the original transformer, as well as RoPE and many others, are located).

Relative Position Encodings

Relative Position Encodings^[57] is similar to ALiBi, but more generic: ${\begin{aligned}{\text{Attention}}(Q,K,V)={\text{softmax}}\left({\frac {QK^{\mathrm {T} }}{\sqrt {d_{k}}}} B\right)V\end{aligned}}$ where $B$ is a Toeplitz matrix, that is, $B_{i,j}=B_{i',j'}$ whenever $i-j=i'-j'$ .

Normalization

In 2020, difficulties with converging the original transformer were solved by applying layer normalization before (instead of after) multiheaded attention. This is called pre-LN Transformer.^[43]

Efficient implementation

FlashAttention

FlashAttention^[58] is an algorithm that implements the transformer attention mechanism efficiently on a GPU. It performs matrix multiplications in blocks, such that each block fits within the cache of a GPU, and by careful management of the blocks it minimizes data copying between GPU caches (as data movement is slow).

An improved version, FlashAttention-2,^[59]^[60]^[61] was developed to cater to the rising demand for language models capable of handling longer context lengths. It offers enhancements in work partitioning and parallelism, enabling it to achieve up to 230 TFLOPs/s on A100 GPUs (FP16/BF16), a 2x speed increase over the original FlashAttention.

Key advancements in FlashAttention-2 include the reduction of non-matmul FLOPs, improved parallelism over the sequence length dimension, better work partitioning between GPU warps, and added support for head dimensions up to 256 and multi-query attention (MQA) and grouped-query attention (GQA).

Benchmarks revealed FlashAttention-2 to be up to 2x faster than FlashAttention and up to 9x faster than a standard attention implementation in PyTorch. Future developments include optimization for new hardware like H100 GPUs and new data types like FP8.

Multi-Query Attention

Multi-Query Attention changes the multiheaded attention mechanism.^[62] Whereas normally,

${\text{MultiheadedAttention}}(Q,K,V)={\text{Concat}}_{i\in [\#heads]}\left({\text{Attention}}(XW_{i}^{Q},XW_{i}^{K},XW_{i}^{V})\right)W^{O}$ with Multi-Query Attention, there is just one $W^{K},W^{V}$ , thus:

${\text{MultiQueryAttention}}(Q,K,V)={\text{Concat}}_{i\in [\#heads]}\left({\text{Attention}}(XW_{i}^{Q},XW^{K},XW^{V})\right)W^{O}$

This has a neutral effect on model quality and training speed, but increases inference speed.

Caching

When an autoregressive transformer is used for inference, such as generating text, the query vector is different at each step, but the already-computed key and value vectors are always the same. The KV caching method saves the computed key and value vectors at each attention block, so that they are not recomputed at each new token. PagedAttention applies memory paging to KV caching.^[63]^[64]^[65]

If a transformer is used with a baked-in prompt, such as ["You are a customer support agent..."], then the key and value vectors can be computed for the prompt, and saved on disk. The saving in compute is significant when the model is used for many short interactions, such as in online chatbots.

Speculative decoding

Transformers are used in large language models for autoregressive sequence generation: generating a stream of text, one token at a time. However, in most settings, decoding from language models is memory-bound, meaning that we have spare compute power available. Speculative decoding^[66]^[67] uses this spare compute power by computing several tokens in parallel. Similarly to speculative execution in CPUs, future tokens are computed concurrently, by speculating on the value of previous tokens, and are later discarded if it turns out the speculation was incorrect.

Specifically, consider a transformer model like GPT-3 with a context window size of 512. To generate an entire context window autoregressively with greedy decoding, it must be run for 512 times, each time generating a token $x_{1},x_{2},...,x_{512}$ . However, if we had some educated guess for the values of these tokens, we could verify all of them in parallel, in one run of the model, by checking that each $x_{t}$ is indeed the token with the largest log-likelihood in the $t$ -th output.

In speculative decoding, a smaller model or some other simple heuristic is used to generate a few speculative tokens that are subsequently verified by the larger model. For example, suppose a small model generated four speculative tokens: ${\tilde {x_{1}}},{\tilde {x_{2}}},{\tilde {x_{3}}},{\tilde {x_{4}}}$ . These tokens are run through the larger model, and only ${\tilde {x_{1}}}$ and ${\tilde {x_{2}}}$ are accepted. The same run of the large model already generated a new token $x_{3}$ to replace ${\tilde {x_{3}}}$ , and ${\tilde {x_{4}}}$ is completely discarded. The process then repeats (starting from the 4th token) until all tokens are generated.

For non-greedy decoding, similar ideas apply, except the speculative tokens are accepted or rejected stochastically, in a way that guarantees the final output distribution is the same as if speculative decoding was not used.^[66]^[68]

Sub-quadratic transformers

Training transformer-based architectures can be expensive, especially for long inputs.^[69] Alternative architectures include the Reformer (which reduces the computational load from $O(N^{2})$ to $O(N\ln N)$ ^[69]), or models like ETC/BigBird (which can reduce it to $O(N)$ )^[70] where $N$ is the length of the sequence. This is done using locality-sensitive hashing and reversible layers.^[71]^[72]

Ordinary transformers require a memory size that is quadratic in the size of the context window. Attention-free transformers^[73] reduce this to a linear dependence while still retaining the advantages of a transformer by linking the key to the value.

Long Range Arena (2020)^[74] is a standard benchmark for comparing the behavior of transformer architectures over long inputs.

Random Feature Attention (2021)^[75] uses Fourier random features: $\varphi (x)={\frac {1}{\sqrt {D}}}[\cos \langle w_{1},x\rangle ,\sin \langle w_{1},x\rangle ,\cdots \cos \langle w_{D},x\rangle ,\sin \langle w_{D},x\rangle ]^{T}$ where $w_{1},...,w_{D}$ are independent samples from the normal distribution $N(0,\sigma ^{2}I)$ . This choice of parameters satisfy $\mathbb {E} [\langle \varphi (x),\varphi (y)\rangle ]=e^{-{\frac {\|x-y\|^{2}}{2\sigma ^{2}}}}$ , or $e^{\langle x,y\rangle /\sigma ^{2}}=\mathbb {E} [\langle e^{\|x\|^{2}/2\sigma ^{2}}\varphi (x),e^{\|y\|^{2}/2\sigma ^{2}}\varphi (y)\rangle ]\approx \langle e^{\|x\|^{2}/2\sigma ^{2}}\varphi (x),e^{\|y\|^{2}/2\sigma ^{2}}\varphi (y)\rangle$ Consequently, the one-headed attention, with one query, can be written as ${\text{Attention}}(q,K,V)={\text{softmax}}\left({\frac {qK^{\mathrm {T} }}{\sqrt {d_{k}}}}\right)V\approx {\frac {\varphi (q)^{T}\sum _{i}e^{\|k_{i}\|^{2}/2\sigma ^{2}}\varphi (k_{i})v_{i}^{T}}{\varphi (q)^{T}\sum _{i}e^{\|k_{i}\|^{2}/2\sigma ^{2}}\varphi (k_{i})}}$ where $\sigma =d_{K}^{1/4}$ . Similarly for multiple queries, and for multiheaded attention.

This approximation can be computed in linear time, as we can compute the matrix $\varphi (k_{i})v_{i}^{T}$ first, then multiply it with the query. In essence, we have managed to obtain a more precise version of ${\text{Attention}}(Q,K,V)={\text{softmax}}\left({\frac {QK^{\mathrm {T} }}{\sqrt {d_{k}}}}\right)V\approx Q(K^{T}V/{\sqrt {d_{k}}})$

Performer (2022)^[76] uses the same Random Feature Attention, but $w_{1},...,w_{D}$ are first independently sampled from the normal distribution $N(0,\sigma ^{2}I)$ , then they are Gram-Schmidt processed.

Multimodality

Transformers can also be used/adapted for modalities (input or output) beyond just text, usually by finding a way to "tokenize" the modality.

Vision transformers^[37] adapt the transformer to computer vision by breaking down input images as a series of patches, turning them into vectors, and treating them like tokens in a standard transformer.

Conformer^[38] and later Whisper^[77] follow the same pattern for speech recognition, first turning the speech signal into a spectrogram, which is then treated like an image, i.e. broken down into a series of patches, turned into vectors and treated like tokens in a standard transformer.

Perceivers by Andrew Jaegle et al. (2021)^[78]^[79] can learn from large amounts of heterogeneous data.

Regarding image outputs, Peebles et al introduced a diffusion transformer (DiT) which facilitates use of the transformer architecture for diffusion-based image production.^[80] Also, Google released a transformer-centric image generator called "Muse" based on parallel decoding and masked generative transformer technology.^[81] (Transformers played a less-central role with prior image-producing technologies,^[82] albeit still a significant one.^[83])

Notes

^ Gated recurrent units (2014) further reduced its complexity.
^ Some architectures, such as RWKV or state space models, avoid the issue.

References

^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, Łukasz; Polosukhin, Illia (2017). "Attention is All you Need" (PDF). Advances in Neural Information Processing Systems. 30. Curran Associates, Inc.
^ ^a ^b Bahdanau; Cho, Kyunghyun; Bengio, Yoshua (September 1, 2014). "Neural Machine Translation by Jointly Learning to Align and Translate". arXiv:1409.0473 [cs.CL].
^ Luong, Minh-Thang; Pham, Hieu; Manning, Christopher D. (August 17, 2015). "Effective Approaches to Attention-based Neural Machine Translation". arXiv:1508.04025 [cs.CL].
^ ^a ^b Hochreiter, Sepp; Schmidhuber, Jürgen (1 November 1997). "Long Short-Term Memory". Neural Computation. 9 (8): 1735–1780. doi:10.1162/neco.1997.9.8.1735. ISSN 0899-7667. PMID 9377276. S2CID 1915014.
^ ^a ^b "Better Language Models and Their Implications". OpenAI. 2019-02-14. Archived from the original on 2020-12-19. Retrieved 2019-08-25.
^ Radford, Alec; Jong Wook Kim; Xu, Tao; Brockman, Greg; McLeavey, Christine; Sutskever, Ilya (2022). "Robust Speech Recognition via Large-Scale Weak Supervision". arXiv:2212.04356 [eess.AS].
^ Monastirsky, Maxim; Azulay, Osher; Sintov, Avishai (February 2023). "Learning to Throw With a Handful of Samples Using Decision Transformers". IEEE Robotics and Automation Letters. 8 (2): 576–583. doi:10.1109/LRA.2022.3229266. ISSN 2377-3766.
^ ^a ^b Ruoss, Anian; Delétang, Grégoire; Medapati, Sourabh; Grau-Moya, Jordi; Wenliang, Li; Catt, Elliot; Reid, John; Genewein, Tim (2024-02-07). "Grandmaster-Level Chess Without Search". arXiv:2402.04494v1 [cs.LG].
^ ^a ^b Wolf, Thomas; Debut, Lysandre; Sanh, Victor; Chaumond, Julien; Delangue, Clement; Moi, Anthony; Cistac, Pierric; Rault, Tim; Louf, Remi; Funtowicz, Morgan; Davison, Joe; Shleifer, Sam; von Platen, Patrick; Ma, Clara; Jernite, Yacine; Plu, Julien; Xu, Canwen; Le Scao, Teven; Gugger, Sylvain; Drame, Mariama; Lhoest, Quentin; Rush, Alexander (2020). "Transformers: State-of-the-Art Natural Language Processing". Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. pp. 38–45. doi:10.18653/v1/2020.emnlp-demos.6. S2CID 208117506.
^ ^a ^b ^c "Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing". Google AI Blog. 2 November 2018. Archived from the original on 2021-01-13. Retrieved 2019-08-25.
^ Feldman, J. A.; Ballard, D. H. (1982-07-01). "Connectionist models and their properties". Cognitive Science. 6 (3): 205–254. doi:10.1016/S0364-0213(82)80001-3. ISSN 0364-0213.
^ Rumelhart, David E.; Mcclelland, James L.; Group, PDP Research (1987-07-29). Parallel Distributed Processing, Volume 1: Explorations in the Microstructure of Cognition: Foundations, Chapter 2 (PDF). Cambridge, Mass: Bradford Books. ISBN 978-0-262-68053-0.
^ Giles, C. Lee; Maxwell, Tom (1987-12-01). "Learning, invariance, and generalization in high-order neural networks". Applied Optics. 26 (23): 4972. doi:10.1364/AO.26.004972. ISSN 0003-6935.
^ Schmidhuber, Jürgen (1992). "Learning to control fast-weight memories: an alternative to recurrent nets" (PDF). Neural Computation. 4 (1): 131–139. doi:10.1162/neco.1992.4.1.131. S2CID 16683347.
^ Hinton, Geoffrey E.; Plaut, David C. (1987). "Using Fast Weights to Deblur Old Memories". Proceedings of the Annual Meeting of the Cognitive Science Society. 9.
^ Katharopoulos, Angelos; Vyas, Apoorv; Pappas, Nikolaos; Fleuret, François (2020). "Transformers are RNNs: Fast autoregressive Transformers with linear attention". ICML 2020. PMLR. pp. 5156–5165.
^ Schlag, Imanol; Irie, Kazuki; Schmidhuber, Jürgen (2021). "Linear Transformers Are Secretly Fast Weight Programmers". ICML 2021. Springer. pp. 9355–9366.
^ Schmidhuber, Jürgen (1993). "Reducing the ratio between learning complexity and number of time-varying variables in fully recurrent nets". ICANN 1993. Springer. pp. 460–463.
^ Schmidhuber, Jürgen (2022). "Deep Learning: Our Miraculous Year 1990-1991". idsia.ch. Retrieved 2024-07-23.
^ ^a ^b ^c Cho, Kyunghyun; van Merrienboer, Bart; Gulcehre, Caglar; Bahdanau, Dzmitry; Bougares, Fethi; Schwenk, Holger; Bengio, Yoshua (2014-06-03). "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation". doi:10.48550/ARXIV.1406.1078. {{cite journal}}: Cite journal requires |journal= (help)
^ ^a ^b ^c Sutskever, Ilya; Vinyals, Oriol; Le, Quoc Viet (14 Dec 2014). "Sequence to sequence learning with neural networks". arXiv:1409.3215 [cs.CL]. [first version posted to arXiv on 10 Sep 2014]
^ Chung, Junyoung; Gulcehre, Caglar; Cho, KyungHyun; Bengio, Yoshua (2014). "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling". arXiv:1412.3555 [cs.NE].
^ Gruber, N.; Jockisch, A. (2020), "Are GRU cells more specific and LSTM cells more sensitive in motive classification of text?", Frontiers in Artificial Intelligence, 3: 40, doi:10.3389/frai.2020.00040, PMC 7861254, PMID 33733157, S2CID 220252321
^ Sutskever, Ilya; Vinyals, Oriol; Le, Quoc V (2014). "Sequence to Sequence Learning with Neural Networks". Advances in Neural Information Processing Systems. 27. Curran Associates, Inc.
^ Luong, Minh-Thang; Pham, Hieu; Manning, Christopher D. (2015). "Effective Approaches to Attention-based Neural Machine Translation". arXiv:1508.04025 [cs.CL].
^ Wu, Yonghui; et al. (2016-09-01). "Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation". arXiv:1609.08144 [cs.CL].
^ Lewis-Kraus, Gideon (2016-12-14). "The Great A.I. Awakening". The New York Times. ISSN 0362-4331. Archived from the original on 24 May 2023. Retrieved 2023-06-22.
^ Parikh, Ankur P.; Täckström, Oscar; Das, Dipanjan; Uszkoreit, Jakob (2016-09-25). "A Decomposable Attention Model for Natural Language Inference". arXiv:1606.01933 [cs.CL].
^ Levy, Steven. "8 Google Employees Invented Modern AI. Here's the Inside Story". Wired. ISSN 1059-1028. Archived from the original on 29 Mar 2024. Retrieved 2024-08-06. {{cite news}}: |archive-date= / |archive-url= timestamp mismatch; 20 March 2024 suggested (help)
^ Peng, Bo; Alcaide, Eric; Anthony, Quentin; Albalak, Alon; Arcadinho, Samuel; Biderman, Stella; Cao, Huanqi; Cheng, Xin; Chung, Michael (2023-12-10), RWKV: Reinventing RNNs for the Transformer Era, doi:10.48550/arXiv.2305.13048, retrieved 2024-08-06
^ Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (11 October 2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805v2 [cs.CL].
^ "Google: BERT now used on almost every English query". Search Engine Land. 2020-10-15. Retrieved 2020-11-24.
^ "Recent Advances in Google Translate". research.google. Retrieved 2024-05-08.
^ "The inside story of how ChatGPT was built from the people who made it". MIT Technology Review. Retrieved 2024-08-06.
^ "Improving language understanding with unsupervised learning". openai.com. June 11, 2018. Archived from the original on 2023-03-18. Retrieved 2023-03-18.
^ finetune-transformer-lm, OpenAI, June 11, 2018, retrieved 2023-05-01
^ ^a ^b Dosovitskiy, Alexey; Beyer, Lucas; Kolesnikov, Alexander; Weissenborn, Dirk; Zhai, Xiaohua; Unterthiner, Thomas; Dehghani, Mostafa; Minderer, Matthias; Heigold, Georg; Gelly, Sylvain; Uszkoreit, Jakob (2021-06-03). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". arXiv:2010.11929 [cs.CV].
^ ^a ^b Gulati, Anmol; Qin, James; Chiu, Chung-Cheng; Parmar, Niki; Zhang, Yu; Yu, Jiahui; Han, Wei; Wang, Shibo; Zhang, Zhengdong; Wu, Yonghui; Pang, Ruoming (2020). "Conformer: Convolution-augmented Transformer for Speech Recognition". arXiv:2005.08100 [eess.AS].
^ Chen, Lili; Lu, Kevin; Rajeswaran, Aravind; Lee, Kimin; Grover, Aditya; Laskin, Michael; Abbeel, Pieter; Srinivas, Aravind; Mordatch, Igor (2021-06-24), Decision Transformer: Reinforcement Learning via Sequence Modeling, doi:10.48550/arXiv.2106.01345, retrieved 2024-08-06
^ Choromanski, Krzysztof; Likhosherstov, Valerii; Dohan, David; Song, Xingyou; Gane, Andreea; Sarlos, Tamas; Hawkins, Peter; Davis, Jared; Mohiuddin, Afroz (2022-11-19), Rethinking Attention with Performers, doi:10.48550/arXiv.2009.14794, retrieved 2024-08-06
^ Liu, Zhuang; Mao, Hanzi; Wu, Chao-Yuan; Feichtenhofer, Christoph; Darrell, Trevor; Xie, Saining (2022). "A ConvNet for the 2020s": 11976–11986. {{cite journal}}: Cite journal requires |journal= (help)
^ Esser, Patrick; Kulal, Sumith; Blattmann, Andreas; Entezari, Rahim; Müller, Jonas; Saini, Harry; Levi, Yam; Lorenz, Dominik; Sauer, Axel (2024-03-05), Scaling Rectified Flow Transformers for High-Resolution Image Synthesis, arXiv:2403.03206
^ ^a ^b Xiong, Ruibin; Yang, Yunchang; He, Di; Zheng, Kai; Zheng, Shuxin; Xing, Chen; Zhang, Huishuai; Lan, Yanyan; Wang, Liwei; Liu, Tie-Yan (2020-06-29). "On Layer Normalization in the Transformer Architecture". arXiv:2002.04745 [cs.LG].
^ Raffel, Colin; Shazeer, Noam; Roberts, Adam; Lee, Katherine; Narang, Sharan; Matena, Michael; Zhou, Yanqi; Li, Wei; Liu, Peter J. (2020-01-01). "Exploring the limits of transfer learning with a unified text-to-text transformer". The Journal of Machine Learning Research. 21 (1): 140:5485–140:5551. arXiv:1910.10683. ISSN 1532-4435.
^ "Masked language modeling". huggingface.co. Retrieved 2023-10-05.
^ "Causal language modeling". huggingface.co. Retrieved 2023-10-05.
^ Kariampuzha, William; Alyea, Gioconda; Qu, Sue; Sanjak, Jaleal; Mathé, Ewy; Sid, Eric; Chatelaine, Haley; Yadaw, Arjun; Xu, Yanji; Zhu, Qian (2023). "Precision information extraction for rare disease epidemiology at scale". Journal of Translational Medicine. 21 (1): 157. doi:10.1186/s12967-023-04011-y. PMC 9972634. PMID 36855134.
^ "Sequence Modeling with Neural Networks (Part 2): Attention Models". Indico. 2016-04-18. Archived from the original on 2020-10-21. Retrieved 2019-10-15.
^ ^a ^b ^c Alammar, Jay. "The Illustrated Transformer". jalammar.github.io. Archived from the original on 2020-10-18. Retrieved 2019-10-15.
^ Clark, Kevin; Khandelwal, Urvashi; Levy, Omer; Manning, Christopher D. (August 2019). "What Does BERT Look at? An Analysis of BERT's Attention". Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Florence, Italy: Association for Computational Linguistics: 276–286. arXiv:1906.04341. doi:10.18653/v1/W19-4828. Archived from the original on 2020-10-21. Retrieved 2020-05-20.
^ ^a ^b ^c Raffel, Colin; Shazeer, Noam; Roberts, Adam; Lee, Katherine; Narang, Sharan; Matena, Michael; Zhou, Yanqi; Li, Wei; Liu, Peter J. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer". Journal of Machine Learning Research. 21 (140): 1–67. ISSN 1533-7928.
^ ^a ^b Tay, Yi; Dehghani, Mostafa; Tran, Vinh Q.; Garcia, Xavier; Wei, Jason; Wang, Xuezhi; Chung, Hyung Won; Shakeri, Siamak; Bahri, Dara (2023-02-28), UL2: Unifying Language Learning Paradigms, doi:10.48550/arXiv.2205.05131, retrieved 2024-08-06
^ Shazeer, Noam (2020-02-01). "GLU Variants Improve Transformer". arXiv:2002.05202 [cs.LG].
^ Dufter, Philipp; Schmitt, Martin; Schütze, Hinrich (2022-06-06). "Position Information in Transformers: An Overview". Computational Linguistics. 48 (3): 733–763. arXiv:2102.11090. doi:10.1162/coli_a_00445. ISSN 0891-2017. S2CID 231986066.
^ Su, Jianlin; Lu, Yu; Pan, Shengfeng; Murtadha, Ahmed; Wen, Bo; Liu, Yunfeng (2021-04-01). "RoFormer: Enhanced Transformer with Rotary Position Embedding". arXiv:2104.09864 [cs.CL].
^ Press, Ofir; Smith, Noah A.; Lewis, Mike (2021-08-01). "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation". arXiv:2108.12409 [cs.CL].
^ Shaw, Peter; Uszkoreit, Jakob; Vaswani, Ashish (2018). "Self-Attention with Relative Position Representations". arXiv:1803.02155 [cs.CL].
^ Dao, Tri; Fu, Dan; Ermon, Stefano; Rudra, Atri; Ré, Christopher (2022-12-06). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness". Advances in Neural Information Processing Systems. 35: 16344–16359. arXiv:2205.14135.
^ "Stanford CRFM". crfm.stanford.edu. Retrieved 2023-07-18.
^ "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning". Princeton NLP. 2023-06-17. Retrieved 2023-07-18.
^ "Introducing Together AI Chief Scientist Tri Dao, as he releases FlashAttention-2 to speed up model training and inference". TOGETHER. Retrieved 2023-07-18.
^ Chowdhery, Aakanksha; Narang, Sharan; Devlin, Jacob; Bosma, Maarten; Mishra, Gaurav; Roberts, Adam; Barham, Paul; Chung, Hyung Won; Sutton, Charles; Gehrmann, Sebastian; Schuh, Parker; Shi, Kensen; Tsvyashchenko, Sasha; Maynez, Joshua; Rao, Abhishek (2022-04-01). "PaLM: Scaling Language Modeling with Pathways". arXiv:2204.02311 [cs.CL].
^ Kwon, Woosuk; Li, Zhuohan; Zhuang, Siyuan; Sheng, Ying; Zheng, Lianmin; Yu, Cody Hao; Gonzalez, Joseph; Zhang, Hao; Stoica, Ion (2023-10-23). "Efficient Memory Management for Large Language Model Serving with PagedAttention". Proceedings of the 29th Symposium on Operating Systems Principles. SOSP '23. New York, NY, USA: Association for Computing Machinery. pp. 611–626. arXiv:2309.06180. doi:10.1145/3600006.3613165. ISBN 979-8-4007-0229-7.
^ vllm-project/vllm, vLLM, 2024-06-20, retrieved 2024-06-20
^ Contribution), Woosuk Kwon*, Zhuohan Li*, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Yu, Joey Gonzalez, Hao Zhang, and Ion Stoica (* Equal (2023-06-20). "vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention". vLLM Blog. Retrieved 2024-06-20.{{cite web}}: CS1 maint: multiple names: authors list (link)
^ ^a ^b Leviathan, Yaniv; Kalman, Matan; Matias, Yossi (2023-05-18), Fast Inference from Transformers via Speculative Decoding, arXiv:2211.17192
^ Fu, Yao (2023-12-13). "Towards 100x Speedup: Full Stack Transformer Inference Optimization".
^ Chen, Charlie; Borgeaud, Sebastian; Irving, Geoffrey; Lespiau, Jean-Baptiste; Sifre, Laurent; Jumper, John (2023-02-02), Accelerating Large Language Model Decoding with Speculative Sampling, arXiv:2302.01318
^ ^a ^b Kitaev, Nikita; Kaiser, Łukasz; Levskaya, Anselm (2020). "Reformer: The Efficient Transformer". arXiv:2001.04451 [cs.LG].
^ "Constructing Transformers For Longer Sequences with Sparse Attention Methods". Google AI Blog. 25 March 2021. Archived from the original on 2021-09-18. Retrieved 2021-05-28.
^ "Tasks with Long Sequences – Chatbot". Coursera. Archived from the original on 2020-10-26. Retrieved 2020-10-22.
^ "Reformer: The Efficient Transformer". Google AI Blog. 16 January 2020. Archived from the original on 2020-10-22. Retrieved 2020-10-22.
^ Zhai, Shuangfei; Talbott, Walter; Srivastava, Nitish; Huang, Chen; Goh, Hanlin; Zhang, Ruixiang; Susskind, Josh (2021-09-21). "An Attention Free Transformer". arXiv:2105.14103 [cs.LG].
^ Tay, Yi; Dehghani, Mostafa; Abnar, Samira; Shen, Yikang; Bahri, Dara; Pham, Philip; Rao, Jinfeng; Yang, Liu; Ruder, Sebastian; Metzler, Donald (2020-11-08). "Long Range Arena: A Benchmark for Efficient Transformers". arXiv:2011.04006 [cs.LG].
^ Peng, Hao; Pappas, Nikolaos; Yogatama, Dani; Schwartz, Roy; Smith, Noah A.; Kong, Lingpeng (2021-03-19). "Random Feature Attention". arXiv:2103.02143 [cs.CL].
^ Choromanski, Krzysztof; Likhosherstov, Valerii; Dohan, David; Song, Xingyou; Gane, Andreea; Sarlos, Tamas; Hawkins, Peter; Davis, Jared; Belanger, David; Colwell, Lucy; Weller, Adrian (2020-09-30). "Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers". arXiv:2006.03555 [cs.LG].
^ Radford, Alec; Kim, Jong Wook; Xu, Tao; Brockman, Greg; McLeavey, Christine; Sutskever, Ilya (2022). "Robust Speech Recognition via Large-Scale Weak Supervision". arXiv:2212.04356 [eess.AS].
^ Jaegle, Andrew; Gimeno, Felix; Brock, Andrew; Zisserman, Andrew; Vinyals, Oriol; Carreira, Joao (2021-06-22). "Perceiver: General Perception with Iterative Attention". arXiv:2103.03206 [cs.CV].
^ Jaegle, Andrew; Borgeaud, Sebastian; Alayrac, Jean-Baptiste; Doersch, Carl; Ionescu, Catalin; Ding, David; Koppula, Skanda; Zoran, Daniel; Brock, Andrew; Shelhamer, Evan; Hénaff, Olivier (2021-08-02). "Perceiver IO: A General Architecture for Structured Inputs & Outputs". arXiv:2107.14795 [cs.LG].
^ Peebles, William; Xie, Saining (March 2, 2023). "Scalable Diffusion Models with Transformers". arXiv:2212.09748 [cs.CV].
^ "Google AI Unveils Muse, a New Text-to-Image Transformer Model". InfoQ.
^ "Using Diffusion Models to Create Superior NeRF Avatars". January 5, 2023.
^ Islam, Arham (November 14, 2022). "How Do DALL·E 2, Stable Diffusion, and Midjourney Work?".