Aspects | LSTM | Transformers |
---|---|---|
Processing Methods | Sequential (processess one token at a time) | Parallel (processes all tokens simultaneously) |
Handling Dependencies | Can struggle with long-term dependencies | Excels at capturing long-term dependencies |
Attention Mechanism | No inherent attention, can be added separately | Inherent multi-headed self-attention mechanism |
Efficiency and Scalability | Less efficient and scalable for very long sequences | More efficient and scalable for long sequences and large datasets |
The transformers architecture is made up of
- Input Embeddings: Converts the input tokens into continuous vectors that the model can work with.
- Encoder: Consists of multiple layers, each containing a self attention mechanism and a feed forward neural network, which process input text.
- Decoder: It also has multiple layers each containing a self attention mechanism, an encoder-decoder attention mechanism and a feed forward neural network, which generate the output text.
- Self-Attention mechanism: Helps the model understand the relationship between words in a sentence, even if they're far apart.
- Positional Encoding: Added to the input embeddings to give the model a sense of word order in the sequence.
- Layer Normalization: A technique used within the encoder and decoder layers to help stabilise the training process.
- Residual Connection: Used in the encoder and decoder layers to help with gradient flow during training and mitigate the vanishing gradient problem.
- Encoder-Decoder attention mechanism: Used in the decoder to help it focus on relevant parts of the input when generating output.
- Output Linear Layer: Converts the decoder's output into logits for each token in the target vocabulary.
- Softmax Layer: Applied to the logits to produce probabilities for each token in the target vocabulary.
- Self Attention: Enables the model to weigh the importance of word-pairs relative to one another and understand relationships between words.
- Scaled dot-product attention: Computes attention scores between words in a sequence throught dot products, scaling and softmax.
- Multi-head attention: Employs multiple heads to simultaneously capture different aspects of input data such as syntactic and semantic realtionships for a more comprehensive understanding.
- Positional Encoding: It provides information about position of words. It involves adding unique, learnable vectors to the input embeddings, allowing the model to recognize the order of words and understand the structure of the sentence.
- Feed Forward Neural Network: It learns complex, non-linear relationship between input embeddings and their context.
- Layer Normalization: It results in improved training stability and faster convergence.
- Encoder: It procesess input text into a continuous representation that captures word relationships and context.
- Decoder: It is used to generate next word in the sentence.
- Score Calculation: score(Q, K) = QKT
- Softmax Normalization: softmax(xi) = exp(xi) / Σ exp(xj)
- Weighted Sum: Attention(Q, K, V) = softmax(QKT)V