Background

Transformers v/s LSTM

Aspects	LSTM	Transformers
Processing Methods	Sequential (processess one token at a time)	Parallel (processes all tokens simultaneously)
Handling Dependencies	Can struggle with long-term dependencies	Excels at capturing long-term dependencies
Attention Mechanism	No inherent attention, can be added separately	Inherent multi-headed self-attention mechanism
Efficiency and Scalability	Less efficient and scalable for very long sequences	More efficient and scalable for long sequences and large datasets

The transformers architecture is made up of

Input Embeddings: Converts the input tokens into continuous vectors that the model can work with.
Encoder: Consists of multiple layers, each containing a self attention mechanism and a feed forward neural network, which process input text.
Decoder: It also has multiple layers each containing a self attention mechanism, an encoder-decoder attention mechanism and a feed forward neural network, which generate the output text.
Self-Attention mechanism: Helps the model understand the relationship between words in a sentence, even if they're far apart.
Positional Encoding: Added to the input embeddings to give the model a sense of word order in the sequence.
Layer Normalization: A technique used within the encoder and decoder layers to help stabilise the training process.
Residual Connection: Used in the encoder and decoder layers to help with gradient flow during training and mitigate the vanishing gradient problem.
Encoder-Decoder attention mechanism: Used in the decoder to help it focus on relevant parts of the input when generating output.
Output Linear Layer: Converts the decoder's output into logits for each token in the target vocabulary.
Softmax Layer: Applied to the logits to produce probabilities for each token in the target vocabulary.

Self Attention: Enables the model to weigh the importance of word-pairs relative to one another and understand relationships between words.
Scaled dot-product attention: Computes attention scores between words in a sequence throught dot products, scaling and softmax.
Multi-head attention: Employs multiple heads to simultaneously capture different aspects of input data such as syntactic and semantic realtionships for a more comprehensive understanding.
Positional Encoding: It provides information about position of words. It involves adding unique, learnable vectors to the input embeddings, allowing the model to recognize the order of words and understand the structure of the sentence.
Feed Forward Neural Network: It learns complex, non-linear relationship between input embeddings and their context.
Layer Normalization: It results in improved training stability and faster convergence.
Encoder: It procesess input text into a continuous representation that captures word relationships and context.
Decoder: It is used to generate next word in the sentence.