Andrej Karpathy is a GOAT, and we all know that. He posted a banger on X few days back - A pure, dependency-free Python implementation of GPT.
Here's his implementation: microgpt.py. It is a minimal transformer language model built entirely from scratch with no external ML dependencies.
I found it very interesting and tried to implement on my own and this project is the result. It is a Java implementation of Karpathy's MicroGPT with few additions.
The project trains a character-level language model on a dataset of names and generates new, plausible-sounding names. More importantly, I tried to engineer it as a step-by-step educational journey, starting from the simplest statistical baseline and progressively building up to a full GPT-style transformer.
Each step introduces one key idea on top of the previous:
| Step | Model | Key Concept |
|---|---|---|
| 1 | Statistical Bigram | Count character pairs, Laplace smoothing, probability sampling |
| 2 | Neural Bigram (manual gradients) | Logits, softmax, cross-entropy, SGD |
| 3 | Neural Bigram (autograd) | Computational graph, chain rule, automatic differentiation |
| 4 | MLP Language Model | Embeddings, context window, hidden layer |
| 5 | GPT (single-head attention) | Self-attention, causal mask, residual connections, RMSNorm |
| 6 | GPT (multi-head attention) | Multi-head attention, Adam optimizer, LR decay |
After 1000 training steps on ~32k names:
Step 1000 / 1000 | Loss: 2.37 | LR: 0.000010
--- Samples ---
Sample 1: phara
Sample 2: laaran
Sample 3: elaned
Sample 4: maret
Sample 5: aiara
Sample 6: malenan
Sample 7: odanini
Sample 8: tely
Sample 9: keba
Sample 10: ena
Hyperparameters (matching Karpathy's gist):
- Vocabulary size: 27 (a–z + BOS)
- Embedding dimension: 16
- Context window (block size): 16
- Attention heads: 4
- Transformer layers: 1
- Total parameters: ~4,256
Prerequisites: Java 17+, Maven
git clone https://github.com/your-username/microgpt-java
cd microgpt-java
mvn compile
mvn exec:java -Dexec.mainClass="com.anirudhology.microgpt.Runner"names.txt is downloaded automatically on first run from Karpathy's makemore dataset.
- No external ML dependencies — everything including autograd, embeddings, attention, and Adam is implemented from scratch using only Java standard library
- No bias in Linear layers — matches modern transformer practice (LLaMA, Mistral); RMSNorm handles the shift
- Pre-Norm style — RMSNorm applied before each sub-layer, more stable than the original Post-Norm transformer
- Learnable gamma in RMSNorm — starts at 1.0 (identity), only diverges if the optimizer finds it useful
Attentioninterface — allowsTransformerBlockto use eitherCausalSelfAttentionorMultiHeadCausalSelfAttentiontransparently, making the single → multi-head progression visible in code
This document walks through every concept in the project in the order they appear, from the simplest statistical baseline to the full transformer. Each section explains the what, the why, and the how.
- Problem Statement: Language Modeling
- Tokenization
- Data Loading & Dataset Building
- Step 1 — Statistical Bigram Model
- Step 2 — Neural Bigram (Manual Gradients)
- Step 3 — Automatic Differentiation (Autograd)
- Step 4 — MLP Language Model
- Step 5 — GPT Transformer (Single-Head)
- Step 6 — GPT Transformer (Multi-Head)
- Optimization: Adam
- Design Decisions
The goal of this project is simple: given a sequence of characters, predict the next character.
We train on a list of names (e.g. emma, olivia, noah). After training, the model generates new names that look
plausible. Thus, the model learns statistical patterns of how characters follow each other in names.
Why names? They are short, structured enough to show real learning, but simple enough to train from scratch in minutes.
Autoregressive generation: at inference time, we feed the model a context, sample the next character, append it to the context, and repeat, thus, generating one character at a time. This is called autoregressive because each output becomes part of the next input.
Loss metric: Negative Log-Likelihood (NLL):
- Perfect prediction
$(P = 1.0)$ → loss$= 0$ - Bad prediction
$(P = 0.1)$ → loss$= 2.30$ - Random guess over 27 chars → loss
$\approx 3.30$
Lower loss means better model. This is our compass throughout.
File: tokenizer/CharacterTokenizer.java
We use character-level tokenization where every unique character in the dataset becomes a token with an integer ID.
a=0, b=1, c=2, ..., z=25, <BOS>=26
BOS (Beginning of Sequence) is a special token with the highest ID (vocabSize - 1 = 26). It serves two purposes:
- As the starting signal when generating: the model begins with BOS and predicts the first character
- As the end signal: the model learns to emit BOS when the name is complete
withBOSOnBothSides("emma") → [26, 4, 12, 12, 0, 26]
This surrounds each name with BOS so the model learns both how names start and how they end.
Files: data/TextCorpus.java, data/NGramDatasetBuilder.java, data/TrainingExample.java
TextCorpus downloads names.txt from Karpathy's makemore repo if not present, then reads, trims, and shuffles all
names.
NGramDatasetBuilder turns the documents into a flat list of (context, target) pairs using a sliding window:
Name: "emma" with BOS on both sides → [26, 4, 12, 12, 0, 26]
context=[26, 26, 26], target=4 (predict 'e' from BOS padding)
context=[26, 26, 4], target=12 (predict 'm' from "..e")
context=[26, 4, 12], target=12 (predict 'm' from ".em")
context=[ 4, 12, 12], target=0 (predict 'a' from "emm")
context=[12, 12, 0], target=26 (predict BOS=end from "mma")
TrainingExample is a simple record:
public record TrainingExample(int[] context, int target) {
}File: model/BaselineBigramModel.java
This class represents the simplest possible language model where we count how often each character follows each other character, then normalize.
counts[i][j] = how many times character j follows character i
After seeing all names, we get a 27×27 table of co-occurrence counts.
Without smoothing, unseen character pairs would have probability 0, causing
This ensures every transition has at least some probability, acting as a prior that all transitions are possible.
To generate a character, we sample from a probability distribution using the CDF (Cumulative Distribution Function) trick:
private int sampleFromDistribution(double[] probabilities) {
// Random number in [0, 1)
double r = this.random.nextDouble();
// Cumulative distribution function
double cdf = 0.0;
for (int i = 0; i < probabilities.length; i++) {
cdf += probabilities[i];
if (r <= cdf) {
return i;
}
}
// Fallback for numerical precision
return probabilities.length - 1;
}This correctly samples each character proportional to its probability.
This model achieves ~2.45 NLL. Any neural model that can't beat this isn't learning anything useful.
File: model/NeuralBigramModel.java
Here also we have prediction task, but instead of a count table we use a learned weight matrix and gradient descent.
The weight matrix
We convert logits to a valid probability distribution:
Properties:
- All outputs are in
$(0, 1)$ - Outputs sum to
$1$ - Higher logit → higher probability
Numerical stability: subtract the maximum logit before exponentiating.
Combined with softmax, this is called cross-entropy loss. It penalizes the model heavily when it assigns low probability to the correct answer.
For softmax + cross-entropy, the gradient has a closed-form:
- For the correct character: gradient
$= p - 1$ (negative → increase this logit) - For all other characters: gradient
$= p$ (positive → decrease these logits)
We subtract because we want to go in the direction that decreases the loss (gradient descent).
At inference time, we divide logits by a temperature
-
$T < 1$ (e.g. 0.5): sharper distribution, more confident, less random -
$T = 1$ : unmodified -
$T > 1$ : flatter distribution, more random/creative
File: autograd/Value.java
Computing gradients by hand works for simple models, but becomes impractical as models grow. We need a system that computes gradients automatically.
Every operation creates a computation graph. It is a directed acyclic graph where:
- Nodes are scalar values (
Valueobjects) - Edges connect each result to its inputs
Every Value stores:
data— the forward-pass resultgradient— accumulated gradient from the backward passchildren— theValueobjects this was computed frombackwardFn— a lambda that computes how to propagate gradient to children
Backpropagation is just the chain rule applied systematically:
Where
| Operation | Forward | Local gradient for |
Local gradient for |
|---|---|---|---|
| — | |||
| — | |||
| — | |||
| — |
backward() runs the chain rule through the entire graph:
public void backward() {
// 1. Build topological order (children before parents)
List<Value> topo = topologicalSort();
// 2. Seed the gradient at the loss node
this.gradient = 1.0;
// 3. Walk backwards, applying each node's backwardFn
for (Value v : reversed(topo)) {
v.backwardFn.run();
}
}Topological sort ensures we always process a node after all nodes that depend on it have already propagated their gradient contributions.
Gradients are accumulated (Value is used in multiple
operations. Before each training step, all gradients must be zeroed.
Files: model/MLPLanguageModel.java, nn/Embedding.java, nn/PositionalEmbedding.java, nn/Linear.java
The bigram only looks at one previous character. We want to look at
File: nn/Embedding.java
An embedding table maps each token ID to a dense vector of floats:
Embedding[vocabularySize][embeddingDimension]
Instead of a one-hot vector (sparse, 27 dimensions), each token gets a 10-dimensional dense vector that the model learns to place in a meaningful space. Semantically similar characters end up near each other.
File: nn/PositionalEmbedding.java
A second lookup table maps each position in the context to its own learned vector:
PositionalEmbedding[blockSize][embeddingDimension]
This tells the model where in the context each character appears, since position matters ("a" at position 0 vs position 2 carry different information).
File: nn/Linear.java
A fully-connected layer:
public Value[] forward(Value[] input) {
Value[] output = new Value[this.outputDimension];
// For each output neuron
for (int j = 0; j < this.outputDimension; j++) {
// Start with bias
Value sum = new Value(0.0);
// Add weighted inputs: sum = b + x₀*w₀ⱼ + x₁*w₁ⱼ + ...
for (int i = 0; i < this.inputDimension; i++) {
sum = sum.add(input[i].multiply(this.weights[i][j]));
}
output[j] = sum;
}
return output;
}Weights are initialized with Xavier/Glorot initialization:
This keeps the variance of activations roughly constant through layers, preventing vanishing or exploding gradients.
No bias: Modern transformers omit bias terms. Normalization layers (RMSNorm) already handle the shift, and biases add parameters without much benefit.
context [blockSize] token IDs
↓ (token embedding lookup)
tokenEmbeddings [blockSize × embDim]
↓ (+ positional embeddings, element-wise)
combinedEmbeddings [blockSize × embDim]
↓ (flatten)
x [blockSize * embDim]
↓ (Linear → tanh)
hidden [hiddenDimension]
↓ (Linear)
logits [vocabularySize]
↓ (softmax → -log → loss)
The hidden layer uses tanh (hyperbolic tangent):
- Output range:
$(-1,\ 1)$ - Smooth, differentiable, zero-centred
- Squashes large values, preventing the hidden layer from growing unboundedly
Files: nn/CausalSelfAttention.java, nn/TransformerBlock.java, model/GPTLanguageModel.java
The MLP mixes all context positions together by flattening. Attention is different because it lets each position selectively focus on other positions.
File: nn/RMSNormalization.java
Root Mean Square Normalization stabilizes training by normalizing activations:
-
$\varepsilon = 10^{-5}$ prevents division by zero -
$\gamma$ is a learnable scale, initialized to$1.0$ (identity at the start) - No mean subtraction (unlike LayerNorm) — simpler, used in LLaMA, Mistral
Applied before each sub-layer (Pre-Norm style), which is more stable than Post-Norm.
File: nn/CausalSelfAttention.java
Attention lets each position query all previous positions and ask: which positions are most relevant to me right now?
Step 1: Project to Q, K, V
Each input vector is linearly projected to three vectors:
- Query (Q): "what am I looking for?"
- Key (K): "what do I contain?"
- Value (V): "what will I contribute if attended to?"
Step 2: Compute attention scores
The
Step 3: Causal mask
Language modeling requires that position
private void applyCausalMask(Value[][] scores) {
final int sequenceLength = scores.length;
for (int i = 0; i < sequenceLength; i++) { // Query position
for (int j = i + 1; j < sequenceLength; j++) { // Future key position
// Replace with -infinity
// exp(-∞) = 0 after softmax → zero attention to future!
scores[i][j] = new Value(Double.NEGATIVE_INFINITY);
}
}
}Step 4: Softmax + weighted sum
Each output is a weighted average of all (visible) value vectors, where the weights come from how relevant each position's key was to the current query.
Step 5: Output projection
Projects back to the original embedding dimension.
File: nn/TransformerBlock.java
A full transformer block has two sub-layers, each with a residual connection:
Residual connections (skip connections) allow gradients to flow directly through the network without passing through every layer. This solves the vanishing gradient problem and makes very deep networks trainable.
MLP inside the block (position-wise feedforward):
where
The 4× expansion gives the network capacity to learn complex per-position transformations.
File: model/GPTLanguageModel.java
tokens [blockSize]
↓ token embedding + positional embedding
x [seqLen × embDim]
↓ RMSNorm
x [seqLen × embDim]
↓ N × TransformerBlock
x [seqLen × embDim]
↓ RMSNorm (final)
x [seqLen × embDim]
↓ Linear (output head): x[last_position]
logits [vocabularySize]
Only the last position's output is used to predict the next token — it has attended to all previous positions and aggregates the full context.
Files: nn/MultiHeadCausalSelfAttention.java, nn/Attention.java
A single attention head looks at all positions through one "lens". Multi-head attention runs several independent heads in parallel, each attending to a different subspace of the embedding.
File: nn/Attention.java
Both attention types implement a common interface:
public interface Attention {
Value[][] forward(Value[][] input);
List<Value> parameters();
}TransformerBlock holds an Attention reference, and picks the implementation based on a useMultiHead flag —
allowing the two approaches to be compared directly.
where
input [seqLen × embDim]
↓ W_q, W_k, W_v projections
Q, K, V [seqLen × embDim]
↓ split into H slices of headDim = embDim / H
For each head h:
Q_h, K_h, V_h [seqLen × headDim]
→ single-head attention → headOut_h [seqLen × headDim]
↓ concatenate all heads
concatenated [seqLen × embDim]
↓ W_o output projection
output [seqLen × embDim]
With
Each head can specialize:
- One head might track syntactic patterns
- Another might track positional proximity
- Another might track semantic similarity
Crucially, the parameter count is identical to single-head attention — multi-head just uses the same parameters more efficiently by attending to multiple subspaces simultaneously.
File: optimizer/AdamOptimizer.java
Plain SGD has one learning rate for all parameters. Adam (Adaptive Moment Estimation) uses per-parameter adaptive learning rates.
First moment (
Second moment (
Bias correction: early in training,
Hyperparameters used:
Starts at
private void run() {
for (int step = 0; step < numberOfSteps; step++) {
double learningRate = initialLearningRate * (1.0 - (double) step / numberOfSteps);
optimizer.zeroGradient(model.parameters());
double loss = model.trainStep(examples.get(step % examples.size()));
optimizer.step(model.parameters(), learningRate);
if ((step + 1) % 100 == 0) {
System.out.printf("Step %4d / %4d | Loss: %.4f | LR: %.6f%n",
step + 1, numberOfSteps, loss, learningRate);
}
}
}Gradients must be zeroed before each backward pass because Value.backward() accumulates gradients (
Modern transformers (LLaMA, Mistral) remove bias terms. Since RMSNorm already handles scale and shift, biases add parameters without meaningful benefit. Removing them reduces the parameter count and matches current practice.
Although Karpathy's minimal version omits
We apply RMSNorm before each sub-layer (Pre-Norm). The original transformer paper used Post-Norm (after), but Pre-Norm is more stable and is the convention in modern models.
Simple, no dependencies, vocabulary size is tiny (27 tokens). Sub-optimal for real language (BPE or WordPiece would give much better compression) but perfect for learning the fundamentals.
Weights are initialized as
Surrounding each name with <BOS> on both ends teaches the model two things simultaneously: how names begin (from BOS →
first char) and how names end (last char → BOS).
Karpathy's original averages loss over all positions in a document and does one backward pass per document. Our implementation does one backward pass per token pair (randomly sampled). Per-document would give more stable gradients; per-token is simpler and still converges.
| Step | Model | Key Concept Introduced | Approx. NLL |
|---|---|---|---|
| 1 | Statistical Bigram | Counting, Laplace smoothing, sampling | ~2.45 |
| 2 | Neural Bigram (manual) | Logits, softmax, SGD, manual gradients | ~2.35 |
| 3 | Neural Bigram (autograd) | Computational graph, chain rule, Value |
~2.35 |
| 4 | MLP | Embeddings, context window, hidden layer, tanh | ~2.1 |
| 5 | GPT (single-head) | Attention, causal mask, residuals, RMSNorm | ~2.4 |
| 6 | GPT (multi-head) | Multi-head attention, Attention interface, Adam |
~2.3 |
Steps 5–6 have higher NLL than the MLP at 1000 training steps because the transformer has more parameters and needs more steps to converge — but it has far greater capacity for longer sequences.
Inspired by Andrej Karpathy's microgpt.py. The original Python implementation achieves the same thing in ~200 lines. This Java port expands it into a structured, educational codebase while preserving the spirit of the original.

