The real building blocks of modern AI — explained from first principles, with genuine intuition.
Series: From Neural Networks to Transformers — Article 1
There’s a phrase that gets repeated endlessly in AI articles:
“Neural networks learn patterns from data.”
And it sounds reasonable. Until you sit with it for a moment and realize: what does that actually mean? What is a “pattern”? What does it mean to “learn”? What is actually happening inside these models that lets them write poetry, diagnose tumors, and beat world champions at chess?
Most explanations either hand-wave through the intuition or drown you in equations. This article does neither. By the end, you’ll have a genuine mental model of how neural networks work — not just a vague sense that they’re “loosely inspired by the brain.”
Let’s build this from the ground up.
Part 1: Why We Stopped Writing Rules
Before neural networks took over, AI researchers tried a different approach: write the rules explicitly.
To build a spam filter, you’d write something like:
IF email contains "$$$" → spam
IF email contains "free money" → spam
IF email contains "invoice" → probably fine
IF email contains a link → suspicious
This worked — for a while. Then reality intervened.
What happens when a legitimate email from your bank contains “free” and a link? What about a phishing email that carefully avoids every keyword you’ve flagged? Language is ambiguous. Context matters. Edge cases multiply faster than you can write rules.
The deeper problem is this: humans are terrible at introspecting on how they recognize things. How do you know a cat from a dog? You just… know. Try writing an explicit rule for that. It’s nearly impossible.
So researchers had a radical idea: instead of telling the computer what rules to use, what if we gave it examples and let it figure out the rules itself?
That’s the entire premise of machine learning.
Part 2: Learning as Finding a Function
Here’s the key abstraction that makes everything else make sense.
Somewhere in the universe, there exists a function — call it f — that maps inputs to correct outputs:
- Input: a photo → Output: “cat” or “dog”
- Input: an email → Output: spam or not spam
- Input: a sentence in English → Output: the same sentence in French
We don’t know what f looks like. But we have thousands (or millions) of examples of its inputs and outputs:
(photo_1, "cat"), (photo_2, "dog"), (photo_3, "cat"), ...
The job of machine learning is to find a function f̂ that closely matches f — not by deriving it mathematically, but by looking at enough examples.
A neural network is just one particularly powerful and flexible way to represent that function. The reason neural networks win is that they can, in principle, approximate any function given enough capacity. We’ll come back to why that’s true.
Part 3: The Artificial Neuron — A Weighted Opinion
The basic unit of a neural network is the neuron. Despite the biological branding, it’s actually much simpler than a real brain cell. Think of it as a very small, very opinionated calculator.
Here’s what a single neuron does:
Step 1: Form a weighted opinion.
It receives several inputs — numbers representing features of your data. Each input gets multiplied by a weight, which represents how much that input matters. Then everything gets summed together, plus a bias term (think of bias as a baseline activation level).
Mathematically:
z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b
Or in vector form:
z = wᵀx + b
Step 2: Make a decision.
The raw sum z gets passed through an activation function σ, which squashes or transforms it:
a = σ(z)
The output a is what flows to the next part of the network.
A Concrete Example
Imagine a neuron trying to predict whether an email is spam. It receives three inputs:
-
x₁= number of suspicious words -
x₂= number of links -
x₃= length of email
The neuron’s weights might look like: w₁ = 0.9, w₂ = 0.4, w₃ = -0.1
This means the neuron has learned that suspicious words are the strongest signal, links are mildly suspicious, and longer emails are less likely to be spam (legitimate emails tend to be longer). The neuron didn’t arrive at these weights through logic — it discovered them by looking at thousands of examples.
That’s the magic. The knowledge lives in the weights.
Part 4: Why Nonlinearity Changes Everything
Here’s a question that trips up a lot of people: if we’re just doing math on numbers, why do we need this “activation function” at all? Why not just use the raw sum?
The answer is one of the most important ideas in deep learning.
Linear operations stacked on linear operations are still just… linear operations.
No matter how many layers you add, if every layer just does output = Wx + b, the entire network could be collapsed into a single layer. You’d never be able to model anything more complex than a straight line (or plane, or hyperplane).
But real-world relationships aren’t linear. The relationship between pixels and “cat-ness” is wildly non-linear. The relationship between words and sentiment is deeply non-linear.
Activation functions introduce nonlinearity — kinks, curves, thresholds — that allow the network to carve up its input space in complex, useful ways.
The Three Activations You Need to Know
Sigmoid — the historical classic:
σ(x) = 1 / (1 + e⁻ˣ)
Squashes any input into a range of (0, 1). Useful for probabilities. Fell out of favor for hidden layers because it causes vanishing gradients — a problem we’ll explain shortly.
Tanh — the improved sigmoid:
tanh(x) = (eˣ - e⁻ˣ) / (eˣ + e⁻ˣ)
Similar shape to sigmoid but outputs range from (-1, 1) and is zero-centered, which helps training. Still suffers from vanishing gradients at extremes.
ReLU — the modern workhorse:
ReLU(x) = max(0, x)
Brutally simple. If the input is positive, pass it through unchanged. If negative, output zero. This simplicity is its strength: it’s fast to compute, doesn’t saturate for large positive values, and makes gradients flow cleanly through deep networks. Most modern networks default to ReLU or a close variant.
import numpy as np
def sigmoid(x): return 1 / (1 + np.exp(-x))
def tanh(x): return np.tanh(x)
def relu(x): return np.maximum(0, x)
# Try it
x = np.array([-3, -1, 0, 1, 3])
print("Sigmoid:", sigmoid(x)) # [0.047, 0.269, 0.5, 0.731, 0.953]
print("Tanh: ", tanh(x)) # [-0.995, -0.762, 0, 0.762, 0.995]
print("ReLU: ", relu(x)) # [0, 0, 0, 1, 3]
Part 5: Layers — Where “Deep” Comes From
A single neuron can only form a single weighted opinion. That’s not very powerful. But stack hundreds of neurons side by side into a layer, and then stack multiple layers on top of each other, and something remarkable happens.
Each layer learns to represent the world at a different level of abstraction.
This is easiest to see in vision. A deep network trained on images learns, layer by layer:
| Layer | What It Detects |
|---|---|
| Layer 1 | Raw edges, color gradients |
| Layer 2 | Corners, curves, textures |
| Layer 3 | Eyes, wheels, fur, windows |
| Layer 4 | Faces, cars, animals, buildings |
| Layer 5 | Full objects in context |
Nobody programmed these layers to work this way. The network discovered this hierarchical decomposition on its own, because it turned out to be the most efficient way to represent the training data.
This is what “deep learning” means: the word “deep” just refers to many layers.
For an entire layer, the math is the same as for a single neuron, but now with matrices instead of vectors:
Z = XW + b
A = σ(Z)
Where X is your batch of inputs, W is the weight matrix for the whole layer, and A is the layer’s output — which becomes the input to the next layer.
Part 6: The Loss Function — Giving the Network a Report Card
The network makes predictions. But how does it know if they’re good?
Enter the loss function (also called the cost function). It takes the network’s predictions and the true labels, and computes a single number representing how wrong the network is. The bigger the loss, the worse the predictions.
For regression (predicting a continuous value), the most common loss is Mean Squared Error:
L = (1/n) Σ (yᵢ - ŷᵢ)²
If the network predicts house prices and gets within $5,000 on average, that’s a low loss. If it’s off by $200,000, that’s a high loss.
For classification (predicting a category), we use Cross-Entropy Loss:
L = -Σ yᵢ log(ŷᵢ)
This loss punishes the network especially hard when it’s confidently wrong — which is exactly the behavior you want. If the true label is “cat” and the network says “99% dog,” the loss is enormous.
The loss function is the mechanism by which the outside world’s expectations get translated into a learning signal. Without it, the network has no idea what “better” means.
Part 7: Gradient Descent — The Learning Algorithm
Now we have a loss. The question becomes: how do we reduce it?
Here’s the key insight: the loss is a function of all the weights in the network. If we imagine a landscape where each point represents a set of weights and the elevation represents the loss, training is the process of finding the lowest valley in this landscape.
The algorithm for doing this is called gradient descent.
The gradient of the loss (∇L) is a vector that points in the direction of steepest increase. So to decrease the loss, we move in the opposite direction:
θ ← θ - η · ∇L
Where:
-
θrepresents all the weights (parameters) -
η(eta) is the learning rate — how big a step we take -
∇Lis the gradient — which direction is uphill
Repeat this process thousands of times across thousands of examples, and the weights slowly converge toward values that make good predictions.
The Learning Rate Problem
The learning rate is one of the most important hyperparameters to get right:
- Too large: The network takes massive steps and overshoots the minimum, bouncing around chaotically or even diverging.
- Too small: Training takes forever, and you might get trapped in a poor local minimum.
Getting the learning rate right is part art, part science — and a major focus of modern optimization research.
Part 8: Backpropagation — The Engine Under the Hood
Computing gradients for a network with millions of parameters sounds impossibly hard. If you had to compute each gradient separately, it would be. But there’s a clever algorithm that does it in one efficient pass: backpropagation.
Backprop is just the chain rule of calculus, applied systematically and repeatedly.
The chain rule says: if L = f(g(x)), then:
dL/dx = (dL/dg) · (dg/dx)
In a neural network, the loss is a function of the output layer, which is a function of the previous layer, which is a function of the layer before that, and so on. Backprop unravels this chain of functions from right to left — from the loss back to the first layer — computing gradients at each step.
Here’s the conceptual picture:
[Input] → [Layer 1] → [Layer 2] → [Layer 3] → [Loss]
Forward pass: data flows left to right →→→→→
Backward pass: gradients flow right to left ←←←←←
Each weight receives a gradient that tells it: “if you increase slightly, does the loss go up or down, and by how much?” This is how each neuron learns its responsibility for the network’s errors.
The reason this is powerful: backprop computes gradients for all parameters in a single backward pass. No matter how many layers or neurons, the computation scales linearly. This is what makes training deep networks tractable.
Part 9: Putting It All Together
Here’s a minimal neural network implemented from scratch — no PyTorch, no TensorFlow, just NumPy:
import numpy as np
class NeuralNetwork:
def __init__(self, input_size, hidden_size, output_size):
# Initialize weights with small random values
self.W1 = np.random.randn(input_size, hidden_size) * 0.01
self.b1 = np.zeros((1, hidden_size))
self.W2 = np.random.randn(hidden_size, output_size) * 0.01
self.b2 = np.zeros((1, output_size))
def relu(self, z):
return np.maximum(0, z)
def relu_grad(self, z):
return (z > 0).astype(float) # 1 where z > 0, else 0
def sigmoid(self, z):
return 1 / (1 + np.exp(-z))
def forward(self, X):
# Layer 1
self.Z1 = X.dot(self.W1) + self.b1
self.A1 = self.relu(self.Z1)
# Layer 2 (output)
self.Z2 = self.A1.dot(self.W2) + self.b2
self.A2 = self.sigmoid(self.Z2)
return self.A2
def compute_loss(self, y_pred, y_true):
# Binary cross-entropy
eps = 1e-8 # prevent log(0)
return -np.mean(y_true * np.log(y_pred + eps) +
(1 - y_true) * np.log(1 - y_pred + eps))
def backward(self, X, y_true, learning_rate=0.01):
n = X.shape[0]
# Output layer gradient
dZ2 = self.A2 - y_true
dW2 = self.A1.T.dot(dZ2) / n
db2 = np.sum(dZ2, axis=0, keepdims=True) / n
# Hidden layer gradient (chain rule)
dA1 = dZ2.dot(self.W2.T)
dZ1 = dA1 * self.relu_grad(self.Z1)
dW1 = X.T.dot(dZ1) / n
db1 = np.sum(dZ1, axis=0, keepdims=True) / n
# Update weights
self.W2 -= learning_rate * dW2
self.b2 -= learning_rate * db2
self.W1 -= learning_rate * dW1
self.b1 -= learning_rate * db1
# Usage
nn = NeuralNetwork(input_size=3, hidden_size=8, output_size=1)
X = np.random.randn(100, 3) # 100 examples, 3 features each
y = (np.random.rand(100, 1) > 0.5).astype(float) # binary labels
for epoch in range(1000):
predictions = nn.forward(X)
loss = nn.compute_loss(predictions, y)
nn.backward(X, y, learning_rate=0.1)
if epoch % 100 == 0:
print(f"Epoch {epoch}: Loss = {loss:.4f}")
This 60-line implementation contains the complete learning loop: forward pass, loss computation, backpropagation, and weight updates. Everything in PyTorch and TensorFlow is this, scaled up by a factor of millions.
Part 10: The Universal Approximation Theorem
Here’s a fact that’s almost philosophically unsettling:
A neural network with a single hidden layer and enough neurons can approximate any continuous function to arbitrary precision.
This is the Universal Approximation Theorem, and it’s why neural networks are so powerful. They’re not limited to modeling linear relationships, or polynomial ones, or any particular family of functions. They’re a universal function approximator.
But this theorem has a catch — or really, two catches:
- “Enough neurons” can mean a lot of neurons. In the worst case, exponentially many.
- The theorem says nothing about whether you can find the right weights. It just says the right weights exist somewhere.
This is why depth matters. Deep networks can represent the same functions as shallow ones, but far more efficiently — with exponentially fewer parameters. A 10-layer network can learn representations that a 1-layer network would need to be astronomically large to match.
Depth isn’t just about having more capacity. It’s about having the right kind of structure to build complex representations hierarchically.
The Limits of What We’ve Built
We now have a powerful framework. But there’s a fundamental assumption baked into everything we’ve discussed:
The network processes fixed-size inputs.
You define the input layer size once, at the beginning, and it never changes. Every example must be the same shape.
This is fine for structured data — a table with 10 columns, or a 28×28 pixel image. But the real world is full of sequential data where this assumption breaks down completely:
- A sentence can be 3 words or 300 words.
- A piece of music can be 30 seconds or 30 minutes.
- A conversation has no fixed length.
Worse, order matters in sequences. “The dog bit the man” and “The man bit the dog” contain the same words but mean completely different things. A standard neural network has no way to represent this.
To handle sequences, researchers needed to give neural networks something they fundamentally lacked: memory.
That led to Recurrent Neural Networks — and eventually to the architecture that powers every major language model today.
What Comes Next
In Article 2, we dig into sequence modeling:
- Why RNNs were such a breakthrough
- How recurrent connections give networks a form of memory
- Why that memory broke down for long sequences (the vanishing gradient problem, revisited)
- And why researchers eventually abandoned RNNs entirely in favor of something stranger and more powerful: attention
→ Next: “Why Your Smart Speaker Can’t Understand You: The Memory Problem in AI”
If this article helped something click, share it with someone who’s been nodding along to AI explanations without quite understanding them. That’s exactly who it’s written for.
