Understanding The Transformer
February 14, 2026
NLP researchers have been trying to model language for quite some time now. This is important for downstream tasks like translation, text generation and question answering. Early efforts focused on sequential neural networks like RNNs and LSTMs — but these architectures struggled with long-range dependencies and couldn't be parallelized, making it hard to scale them effectively. That was until 2017, when a new architecture called the Transformer was introduced in the now seminal paper, Attention is All You Need.
I spent the last couple days implementing the core architecture from scratch and thought it would be fun to share my understanding here. Please reach out with any thoughts or feedback you have!
The best way to understand how transformers work is to step through the architecture one layer at a time, so let's start with the input. Imagine you have a sentence - The cat is sleeping on the carpet. Our first step is tokenization. This is where you take the input sequence and break it down into tokens — subword units that could be whole words, parts of words, or even single characters. The most popular tokenization method is Byte Pair Encoding.
For ease of understanding, lets assume that the tokens this time are the individual words in the sentence. All such unique words combine to form the vocabulary of our model, denoted by V. For the most widely used tokenizers like BPE, this vocabulary size can go up to ~50K tokens.
Once we have tokenized the input, we can assign each token a unique index based on its position in our vocabulary. The next step is putting these tokens in a form that allows our model to understand their meaning. This representation problem was solved way before the introduction of the transformer with approaches like Word2Vec. The idea is to represent each token as a dense vector in a -dimensional space that encodes semantic meaning and relationships between words. These representations are learned by training on a large corpus of text. For our transformer, this will look like taking the input token indices and looking them up in a learned embedding matrix which will give us the vector representation for all the tokens.
While embeddings help the model capture the semantic meaning of words, it is still important to know the positions of those words in a sentence. Understanding positions is important because the closer the words are, the more they influence each other's relative meaning in the sentence. The original paper introduced the idea of absolute positional encoding. This is when we create additional position vectors with the same dimension as our embedding vectors and add both of them up. This gives us a final representation that captures both meaning and positions of all tokens. These position vectors are created using sinusoidal functions. You can think of this as each dimension of the vector having a distinct frequency of the sinusoidal function that gives a unique, periodic value that the model can interpret to discern positions.
While this approach is effective, most LLMs today use RoPE which is another positional encoding method that uses a matrix transformation for the embedding vectors in the -dimensional space. This blog from Nikhil Paleti is the perfect deep dive into both of these techniques.
While all of the above steps are effective, what matters the most is the model's ability to understand the context of the input sequence. Imagine you have the word Apple in your input. It could mean the fruit in one context and the tech company in another. Or consider the sentence "The cat sat on the carpet because it was tired" — does "it" refer to the cat or the carpet? Resolving this kind of ambiguity was one of the key challenges for previous approaches to language modeling. The solution came with the introduction of the attention mechanism. This is the crux of the transformer architecture.
Attention
The attention mechanism allows the model to capture relationships between tokens in the input sequence. Let's start with understanding the math behind it. We take embeddings from the previous steps and multiply them with a matrix. For each input vector, this matrix transformation will give us a Query Vector, Key Vector and Value Vector.
Now we use these vectors to calculate the "attention score" within an attention head. You will first split the output to give you distinct , , matrices. You now take the query vector for each token and multiply it with the key vector of every other token in the input sequence. Mathematically, this is equivalent to . This will give you a grid like score matrix. Since dot products represent the similarity between two vectors, think of this process as the token with the query vector going around asking all the tokens with the key vectors, "how similar am I to you?"
Once we have all the scores, we will normalize them by dividing by (the square root of the key dimension) and then applying a softmax across each row. We divide by specifically because which keeps the vairance roughly constant. We then take the normalized attention score matrix and multiply it with our value matrix. This will give us a new output matrix that will contain a new set of vector representations for all our tokens. This last step lets us re-represent every token as a weighted sum of value vectors, where the weights are the attention scores with all tokens in the sequence, including itself. In a sense, we are allowing all tokens to influence how the model understands each token.
In some cases, we use causal attention where we mask the upper triangular half of the attention score matrix. This means that every token will only be able to "attend to" itself and the tokens before it. This is important when we are training these models for next token prediction.
A better intuition to understand this is to think of the word Apple. Let's imagine that before applying attention, our embedding space put it somewhere in between fruits and tech companies due to the mix of representations in the training data. Self‑attention computes a context‑dependent weighted sum of other tokens’ value vectors, producing a new “Apple” vector that, in fruit contexts, lies closer to the fruit region of the space and, in tech contexts, lies closer to the tech‑company region. This video offers good intuition on what I'm talking about.
Most implementations prefer multi-headed attention where instead of one head doing the QKV calculations, we have multiple heads that each focus on a different aspect of relationships between tokens. Each head has its own projections into a lower-dimensional space (), and the outputs from all heads are concatenated back together.
Completing the architecture
Now that you understand the crux of what makes transformers so capable, we can now focus on what makes a complete transformer block. In addition to the attention layer, there are two components here:
-
Add and Norm: Before the embeddings enter the attention block, we apply layer normalization which standardizes the activations to zero mean and unit variance, stabilizing training. We then form what is called a residual connection where we take our input embedding matrix and add it to the output of the sublayer. This residual connection helps gradients flow through deep networks and improves the model's performance as the layers get deeper.
-
MLP: In every block, after the attention layer we will have a feed forward layer which is an MLP in most cases. The input layer here is the same dimension as the vector but the hidden layer is some multiple of that along with an activation function like GELU(). This introduces non-linearity and also helps process the contextualized representation from attention independently per token. We apply the same Add and Norm to the MLP layer as we did to the attention layer.
The final part of the architecture is a linear layer that takes the output from the last transformer block and projects it onto a vector of some defined dimension. For classification tasks, that could be num_classes or for next token prediction that could be .
Training
During training, we give the model an input sequence and then compare the final layer output to the ground truth next token and define a loss function. In case of next token prediction, this is cross-entropy loss over the softmaxed logits from the final linear layer. This loss is then backpropagated to all the weights, including the embedding matrix, attention layers and feed-forward MLP layers.