Deep Dive into Andrej Karpathy's microGPT: Building a Transformer from Scratch

Understanding the inner workings of Large Language Models (LLMs) often feels like peering into a black box. However, Andrej Karpathy’s microGPT project strips away the complexity of modern frameworks like PyTorch or JAX to reveal the raw logic of the Transformer architecture. This guide provides a comprehensive analysis of the microGPT architecture, from its custom autograd engine to its character-level generation logic. When moving from these educational micro-models to production-grade systems like those found on n1n.ai, having a firm grasp of these fundamentals is essential for debugging and optimization.

1. Data Preprocessing and Character-Level Tokenization

Unlike production models such as Claude 3.5 Sonnet or OpenAI o3, which use sophisticated Byte Pair Encoding (BPE), microGPT utilizes the simplest possible tokenizer: character-level mapping. The process begins by ensuring an input.txt file exists, typically a dataset of names or Shakespearean text. Each line is treated as an individual document.

The Vocabulary Logic

The model identifies every unique character in the dataset to build its vocabulary. A special BOS (Beginning of Sequence) token is appended to the vocabulary. This token is multi-functional: it signals the start of a sequence during generation and acts as a stop signal when sampled as an output.

uchars = sorted(set(''.join(docs)))
BOS = len(uchars)
vocab_size = len(uchars) + 1

For example, the name "emma" is transformed into [BOS, e, m, m, a, BOS]. This explicit framing teaches the model not just the structure of the word, but also where it begins and ends. In the context of n1n.ai, understanding tokenization is critical because it directly impacts pricing and context window management for high-performance APIs.

2. The Embedding Layer: Identity and Position

In microGPT, every token ID is converted into a 16-dimensional vector. However, a single vector isn't enough; the model needs to know what the character is and where it is located. This is achieved through two distinct embedding tables:

wte (Token Embedding Table): Encodes the identity of the character. "e" always starts with the same base vector.
wpe (Position Embedding Table): Encodes the sequence index (0 to block_size). This provides the spatial context necessary for the Transformer to distinguish between an "e" at the start of a word versus the end.

These two vectors are added element-wise: x = [t + p for t, p in zip(tok_emb, pos_emb)]

This combined vector carries the full identity and positional context into the Transformer blocks. Without wpe, the model would be a "bag of words" model, unable to perceive the order of characters.

3. The Core Engine: The Value Class and Autograd

The most remarkable aspect of microGPT is the Value class. It is a minimal replacement for PyTorch’s entire autograd system. Every scalar in the model—weights, biases, and intermediate activations—is wrapped in a Value object.

The Anatomy of a Value

class Value:
    def __init__(self, data, children=(), local_grads=()):
        self.data = data       # The scalar value
        self.grad = 0          # Gradient accumulated via chain rule
        self._children = children
        self._local_grads = local_grads

    def backward(self):
        # Reverse topological sort to apply the chain rule
        ...

This scalar-level processing means the model does not use optimized tensor operations (matrix multiplications). Instead, it performs explicit Python loops over Value objects. While this is computationally expensive and slow, it is educationally transparent. It allows developers to see exactly how gradients flow from the loss function back to the initial embeddings.

4. The Transformer Block: Pre-Norm and RMSNorm

microGPT follows the modern "Pre-Norm" design. RMSNorm (Root Mean Square Layer Normalization) is applied before each sublayer (Attention and MLP). This ensures that the values remain in a stable range, preventing the vanishing or exploding gradient problems that plagued early deep networks.

Pro Tip: In this implementation, RMSNorm has no learnable parameters (no scale or shift). It is a purely mathematical normalization: x / sqrt(mean(x²) + ε).

Multi-Head Attention (MHA) without Tensors

In microGPT, attention is calculated using scalar arithmetic. For a model with n_embd = 16 and n_head = 4, each head handles a 4-dimensional slice. The causality is enforced structurally. Because the model processes tokens one-by-one and stores them in a KV (Key-Value) cache, it cannot "see" future tokens because they haven't been computed yet.

# Scalar-based attention score calculation
attn_logits = [sum(q_h[j] * k_h[t][j] for j in range(head_dim))
               for t in range(len(keys[li]))]

5. Hyperparameters and Model Capacity

The capacity of microGPT is governed by four primary hyperparameters:

Hyperparameter	Value	Description
`n_embd`	16	The width of the vector representations
`n_head`	4	The number of independent attention heads
`n_layer`	1	The number of Transformer blocks (depth)
`block_size`	10	The maximum context window (sequence length)
`Total Params`	~4,192	Total learnable scalar values

Compared to production models like DeepSeek-V3 (hundreds of billions of parameters), microGPT is microscopic. However, the underlying algorithm—calculating queries, keys, and values—is identical. Developers looking for high-speed, scalable versions of these architectures often turn to n1n.ai to access optimized LLM APIs that handle the infrastructure heavy lifting.

6. Training with the Adam Optimizer

The training loop implements next-token prediction. If the input is "J", the target is "e". The loss is calculated using the negative log-likelihood of the correct character’s probability.

loss_t = -probs[target_id].log()

The Adam optimizer then updates the weights. Adam is used here because it maintains moving averages of the gradients (moments), which helps smooth out the learning process in the absence of large batch sizes. The learning rate follows a linear decay, starting at 0.01 and dropping to zero over the course of training.

7. Inference and the Role of Temperature

During generation, the model uses its own output as the next input—a process called autoregressive generation. The temperature parameter is applied to the logits before the softmax function:

Low Temperature (e.g., 0.5): Sharpens the distribution, making the model more conservative and likely to pick the highest-probability token.
High Temperature (e.g., 1.5): Flattens the distribution, introducing more randomness and "creativity."

8. Why microGPT Fails on Shakespeare

When trained on Shakespeare, microGPT learns basic word structures ("the", "and") and punctuation but fails at long-range coherence. There are three structural reasons for this:

Context Limitation: A block_size of 10 means the model only ever sees 10 characters of context.
Lack of Continuity: Each line is treated as an independent document, so the model never learns how one sentence relates to the next.
Low Capacity: With only 1 layer and 16-dimensional embeddings, the model lacks the "memory" to store complex stylistic patterns.

Conclusion

microGPT is a masterclass in minimalist AI engineering. By removing the abstractions of modern libraries, it reveals the elegance of the Transformer. For developers moving beyond experiments and into production, managing these complexities requires robust tools. Whether you are implementing RAG (Retrieval-Augmented Generation) or fine-tuning models for specific enterprise needs, the stability and speed of your API provider are paramount.

For those ready to scale their AI applications with production-grade reliability and top-tier performance benchmarks, n1n.ai offers a unified gateway to the world's most powerful LLMs.

Get a free API key at n1n.ai

Source: https://dev.to/rsrini7/andrej-karpathys-microgpt-architecture-complete-guide-em8