Neural Networks
Hypotheses, activations, forward propagation, losses, and initialization
Neural networks can feel abstract until you see how the pieces fit together. This walkthrough links the core ideas to the supporting math and code so each component slots into place.
Neural network hypothesis
A feed-forward neural network applies a sequence of affine transformations followed by element-wise nonlinear functions. Each layer produces intermediate activations that serve as the inputs to the next layer.
Mathematics. In linear or logistic regression, the hypothesis maps inputs \(x\) to output \(\hat{y}\) via a single affine transform (and possibly a sigmoid). A neural network composes multiple transforms and nonlinearities across layers:
Let the network have \(L\) layers (not counting the input as a layer). For layer \(\ell = 1, \dots, L\): - Parameters: \(W^{[\ell]} \in \mathbb{R}^{n_\ell \times n_{\ell-1}}\), \(b^{[\ell]} \in \mathbb{R}^{n_\ell}\) - Pre-activation: \(z^{[\ell]} = W^{[\ell]} a^{[\ell-1]} + b^{[\ell]}\) - Activation: \(a^{[\ell]} = g^{[\ell]}(z^{[\ell]})\)
with \(a^{[0]} = x\). The hypothesis is the function \(h_{\Theta}(x) = a^{[L]}\) produced by repeated forward propagation.
Batch view. Vectorised over a batch \(X \in \mathbb{R}^{n_0 \times m}\): \(Z^{[\ell]} = W^{[\ell]}A^{[\ell-1]} + b^{[\ell]}\mathbf{1}^\top,\quad A^{[\ell]} = g^{[\ell]}(Z^{[\ell]})\).
Activation functions
Activation functions introduce nonlinearity, allowing the network to represent decision boundaries that are not purely linear. They also control gradient flow during training, so the choice of function affects both expressiveness and optimisation stability.
What to know. - Sigmoid \(\sigma(z)=\frac{1}{1+e^{-z}}\): outputs \((0,1)\). Great for probabilities; watch out for saturated gradients when \(|z|\) is large. - Tanh \(\tanh(z)\): zero-centered, mitigating bias shifts; still saturates at the extremes. - ReLU \(\max(0,z)\): keeps positive values, zeroes the rest. Trains quickly but can “die” if gradients vanish. - Leaky ReLU / GELU / ELU: variants that smooth or allow a small negative slope to avoid dead neurons. - Softmax for multi-class output: \(\displaystyle \mathrm{softmax}(z)_k = \frac{e^{z_k}}{\sum_j e^{z_j}}\) to turn logits into probabilities.
Default playbook. ReLU (or GELU) covers most hidden layers, sigmoid suits binary outputs, softmax suits multi-class heads, and tanh remains useful when symmetry around zero is helpful.
Forward propagation
The forward pass iterates through layers, computing pre-activations and activations in order. Intermediate values are stored for later use in gradient computations and for inspection when debugging.
Algorithm. For each layer \(\ell = 1 \dots L\): 1. \(Z^{[\ell]} = W^{[\ell]}A^{[\ell-1]} + b^{[\ell]}\mathbf{1}^\top\) 2. \(A^{[\ell]} = g^{[\ell]}(Z^{[\ell]})\)
Return \(A^{[L]}\) as the prediction(s). Vectorisation keeps this efficient for batches.
Numerical stability tips. Use the logsumexp
trick for softmax and clamp probabilities to \([\varepsilon, 1-\varepsilon]\) when computing log-losses.
Loss functions
Loss functions convert network predictions and reference targets into a single scalar objective. Minimising that scalar guides the optimiser toward parameters that align predictions with observed data.
Core losses. - Binary classification: \(J = -\frac{1}{m}\sum_i y_i\log \hat{y}_i + (1-y_i)\log (1-\hat{y}_i)\) with \(\hat{y} = \sigma(z^{[L]})\). - Multi-class (softmax): \(J = -\frac{1}{m}\sum_i \log \hat{p}_{i,y_i}\) where \(\hat{p}_i = \mathrm{softmax}(z^{[L]}_i)\). - Regression: Mean squared error \(J = \tfrac{1}{2m}\sum_i \lVert \hat{y}_i - y_i \rVert^2\).
Add regularisation (e.g., L2 weight decay) to reduce overfitting: \(J_\lambda = J + \frac{\lambda}{2m}\sum_{\ell}\lVert W^{[\ell]} \rVert_F^2\).
Parameter initialisation
Suitable initial values keep activations and gradients within reasonable ranges at the start of training. Poor choices can create symmetric neurons that behave identically or cause activations to saturate before learning begins.
Guidelines. - Avoid all-zero weights; symmetry would keep neurons identical. - Xavier/Glorot for tanh or sigmoid: \(W^{[\ell]} \sim \mathcal{N}\big(0, \tfrac{2}{n_{\ell-1}+n_\ell}\big)\) or the uniform equivalent. - He/Kaiming for ReLU: \(W^{[\ell]} \sim \mathcal{N}\big(0, \tfrac{2}{n_{\ell-1}}\big)\). - Biases \(b^{[\ell]}=0\) are typically fine.
Minimal NumPy forward pass
This reference implementation keeps the forward equations explicit for inspection or quick experiments.
import numpy as np
def sigmoid(z):
= np.empty_like(z, dtype=float)
out = z >= 0
pos = ~pos
neg = 1.0 / (1.0 + np.exp(-z[pos]))
out[pos] = np.exp(z[neg])
ez = ez / (1.0 + ez)
out[neg] return out
def relu(z):
return np.maximum(0, z)
def he_init(n_in, n_out, rng):
return rng.normal(0, np.sqrt(2.0/n_in), size=(n_out, n_in))
def forward(X, params):
"Run forward propagation over layers defined by params."
= X
A for (W, b, act) in params[:-1]:
= W @ A + b[:, None]
Z = relu(Z) if act == "relu" else np.tanh(Z)
A = params[-1]
W, b, _ = W @ A + b[:, None]
ZL return sigmoid(ZL)
# Example architecture: 2 -> 8 -> 4 -> 1 (binary)
= np.random.default_rng(0)
rng = 2, 512
n_in, m = rng.normal(size=(n_in, m))
X
= []
params = he_init(n_in, 8, rng); b1 = np.zeros(8)
W1 = he_init(8, 4, rng); b2 = np.zeros(4)
W2 = he_init(4, 1, rng); b3 = np.zeros(1)
W3 = [(W1, b1, "relu"), (W2, b2, "relu"), (W3, b3, "sigmoid")]
params
= forward(X, params)
AL print(AL.shape, AL.min(), AL.max()) # (1, m), ~[0,1]
For multi-class problems, replace the final activation with softmax and compute cross-entropy loss on one-hot targets.
Practical habits
- Scaling. Standardise inputs and reuse the same transform at inference time.
- Regularisation. Weight decay, dropout, and early stopping control variance.
- Batching. Mini-batches stabilise gradient estimates and hardware throughput.
- Monitoring. Track training versus validation metrics to catch drift early.
- Reproducibility. Fix RNG seeds and log configurations so experiments can be replayed.
Where to go next
- Implement backpropagation for the architectures above to connect gradients to the forward equations.
- Explore modern activation choices (GELU, Swish) and compare training curves.
- Experiment with initialisation scalings on deeper networks to observe exploding or vanishing activations.