Function Approximation as Supervised Learning

Function approximation is about teaching a computer to guess the rules of a hidden game using examples. You have inputs, like ingredients, and outputs, like the taste of a cake, but the exact recipe is unknown. Supervised learning trains a model using labeled examples, which are pairs of inputs and correct outputs. This training helps the model mimic the mystery function. The goal is to build a formula that makes solid predictions on new, unseen data, even without knowing the true math behind it.

The Parametric Approach

At its core, supervised learning is about function approximation. We assume there is a "true" hidden rule in nature that connects an input $x$ to an output $y$.

This rule is not directly visible, so we use a parametric function $f_\theta$ to approximate it. The $\theta$ in $f_\theta$ is a set of adjustable parameters. By adjusting these parameters, we change the function's behavior until its predictions, $\hat{y}$, match the real-world results, $y$, as closely as possible.

In essence, we're adjusting the function's parameters of our model so that the predicted output, $\hat{y}$, closely matches the actual output, $y$.

But we have a problem. It can be understood by understanding our needs and the practical reality.

In a perfect world, we would have access to the Data Distribution $p_{data}(x, y)$, a mathematical map that details every possible pair of $(x, y)$ and their likelihood of occurrence. Our goal is to evaluate how well our parameters ($\theta$) are set, which is achieved by defining a Distance Function $D(\hat{y}, y)$. This function measures the error in our guess. Since we're more concerned with accuracy on common data points, we focus on those that occur frequently. The Expected Cost $C(\theta)$ is calculated as the average error, weighted by the likelihood of each data point. This is expressed as: $$C(\theta) = \mathbb{E}_{(x,y) \sim p_{data}} [D(f_\theta(x), y)] = \int_x \int_y p_{data}(x, y) D(\hat{y}, y) dx dy$$ Intuition: Consider we are learning to play darts. The Data Distribution $p_{data}$ indicates where the board is usually located. The Expected Cost represents the average distance from the bullseye if we were to throw an infinite number of darts. Our objective is to minimize this average distance, a concept known as Expected Risk Minimization.

But, that's not the reality, we don't live in a perfect world, thus: We don't know $p_{data}$. We don't have a map of every possible human thought or image; we only have a small "training set" of $N$ examples that someone collected for us: $$\{(x_1, y_1), (x_2, y_2), \dots, (x_N, y_N)\}$$

Since we can't calculate the true average over all possible data, we calculate the average over the data we actually have. This is the Empirical Cost Function $\tilde{C}(\theta)$: $$\tilde{C}(\theta) = \frac{1}{N} \sum_{n=1}^{N} D(f_\theta(x_n), y_n)$$

For example, since we can't throw darts for eternity, we throw 100 darts and measure the average score on those specific throws. This method is called a Monte Carlo approximation, which uses a limited sample to estimate a larger, unknown value.

Why does sampling work? In probability, if we were to toss a coin 10 times, the chance of getting Heads and Tails might be 70% and 30%. But if were to increase the sample to let's say 100,000 times. The probability of the each event would come out to be $1/2$.

Example for Law of large numbers. You can run the program to verify it here.

import random
import pandas as pd

# Simulate 100,000 coin flips
flips = [random.choice([0, 1]) for _ in range(1000000)]

# Count heads (1) and tails (0)
heads_count = sum(flips)
tails_count = len(flips) - heads_count

# Create a summary DataFrame
summary = pd.DataFrame({
    'Outcome': ['Tails (0)', 'Heads (1)'],
    'Count': [tails_count, heads_count],
    'Percentage': [tails_count / len(flips) * 100, heads_count / len(flips) * 100]
})

print(summary)

We aim to minimize the Expected Cost, which is the true cost, but we can only calculate the Empirical Cost, which is based on our sample. In machine learning, our approach is to collect a large amount of data, with the idea that minimizing the error on our sample will ultimately result in a model that is effective in the real world.

Learning as Optimization

Moving forward, we need to understand that, when they say a machine is "learning," they really mean it is solving a math problem. Specifically, it is trying to find the best set of parameters ($\theta$) that makes our error (the Empirical Cost, $\tilde{C}$) as small as possible.

Gradient Descent

In some rare cases, we can solve for the perfect theta using a single equation, a closed-form solution. For complex models like neural networks, this is impossible. We use an iterative approach instead, starting with random parameters and refining them bit by bit.

The most popular method is Gradient Descent. The Gradient is a measure that points in the direction where the error increases fastest, like a compass pointing uphill. To reduce error, we move in the opposite direction of the gradient, downhill. The Learning Rate, denoted by $\eta$, determines our step size. If $\eta$ is too small, learning takes a long time. If $\eta$ is too large, we might overshoot the minimum and fail to find the optimal solution. This update rule is captured by the equation $\theta_{new} = \theta_{old} = \eta \nabla J(\theta)$, where $\nabla J(\theta)$ is the gradient of the cost function.

Overshooting: In a gradient descent, reaching the lowest point at the bottom can be a challenging task. Taking tiny steps, with a small learning rate $\eta$, means we will get there eventually, but it will take a long time. On the other hand, taking huge leaps with a large $\eta$ can cause problems. We might jump from one side of the valley to the other, and end up climbing back up the opposite wall, never reaching the bottom. This can be particularly bad if you jump so far that you end up higher than where you started.

The Math: For our use case, we change the cost function for gradient descent with empirical cost, that is replacing $\nabla J(\theta)$ with $\nabla \tilde{C}(\theta)$; $$\theta_{new} \leftarrow \theta_{old} - \eta \nabla \tilde{C}(\theta)$$

Scaling Up: Stochastic Gradient Descent (SGD)

As our datasets grow to millions or billions of examples, calculating the gradient becomes a nightmare. To compute the gradient, we technically have to look at every single example in our database just to take one tiny step. This is where the minibatch comes in. Instead of looking at the whole database, we look at a small, random handful of data points, a minibatch. We approximate the true gradient using only this subset, which is calculated as the average of the gradients of the individual data points in the minibatch: $$\nabla \tilde{C}_M(\theta) = \frac{1}{|M|} \sum_{m \in M} \nabla D(\hat{y}_m, y_m)$$

Why this works?

Simple answer is Population. When surveying a large crowd to find the average height, we don't need to measure every individual to get a good estimate. A random sample of 100 people can provide a fairly accurate guess. This concept is applied in machine learning, where Stochastic Gradient Descent uses these estimates to take many quick steps, rather than one precise step.

The noise from using smaller samples is actually beneficial. It helps prevent the model from getting stuck in local minima, which are small dips in the landscape that are not the lowest point. In high-dimensional spaces, such as neural networks with millions of parameters, local minima are less common than saddle points. These are points where the gradient is zero, but they are not true minima. The gradients from smaller samples are noisy, and this noise acts like small vibrations that shake the model out of these saddle points, allowing it to continue searching for better solutions.

When do we stop learning?

If we use Stochastic Gradient Descent to minimize our error, the logical conclusion is to keep going until the error is zero. However, there are several hurdles to consider. We can't see the Expected Cost, which is the real world, and the Empirical Cost, our data, is noisy and expensive to compute. The error map is often non-convex, meaning it has many local minima.

The biggest issue is the difference between our sample data and reality. A set of parameters that fits the training data well may perform poorly in the real world, a problem known as Overfitting. This can be seen in the math, where a set of parameters that works for the training data does not work for the real world.

To solve this, we split our data into two groups:

$D_{train}$: The data the model "studies" from.
$D_{val}$: A "validation" set the model never sees during training, used as a pop quiz to check its progress.

We define two separate costs: $$C_{train}(\theta) = \frac{1}{|D_{train}|} \sum_{(x,y) \in D_{train}} D(\hat{y}, y)$$ $$C_{val}(\theta) = \frac{1}{|D_{val}|} \sum_{(x,y) \in D_{val}} D(\hat{y}, y)$$

While training, we keep a close eye on the validation cost, $C_{val}$. At first, both the training cost, $C_{train}$, and $C_{val}$ tend to decrease. However, as training progresses, $C_{train}$ continues to drop, but $C_{val}$ begins to increase. This indicates that the model is no longer learning the underlying rules, and is instead memorizing the noise.

The early stopping method involves freezing the parameters, $\theta$, at the point when the validation cost is at its lowest.

Model Selection

The choice of selecting the best model from a choices of machine is called a Hypothesis, denoted as $M$. The Hypothesis Space, denoted as $H$, is a collection of every possible function we could use, including linear, quadratic, or complex neural networks with many layers, such as a 100-layer neural network. Our objective is to find the best model for the job.

Because we want to find the best possible hypothesis, we might test several different types:

$M_{H_1}$: A simple linear model ($y = ax + b$).
$M_{H_2}$: A complex polynomial model ($y = ax^2 + bx + c$).
$M_{H_3}$: A neural network.

How do we pick?

We use the Validation Cost, similar to early stopping. Each model is trained to its best possible state, then their scores are compared on the validation set. The model with the lowest $C_{val}(M)$ wins.

The intuition behind this process is simple. Consider hiring a chef, where three candidates are given the same ingredients, $D_{train}$, to practice with. To determine who is best, you wouldn't taste the food they practiced on. Instead, you would give them a brand-new recipe, $D_{val}$, that they have never seen before. The chef who performs best on this new recipe is the one you would hire.

Evaluation: The Golden Rule

We previously used a validation set to decide when to stop training, a technique known as Early Stopping. However, there is a hidden danger: if we keep choosing models or stopping points based only on the validation set, we are essentially learning the validation set too.

To get a truly honest score, we need to split our data into three separate sets. The training set, denoted as $D_{train}$, is used to update the parameters $\theta$. The validation set, denoted as $D_{val}$, is used for Early Stopping and Model Selection. The test set, denoted as $D_{test}$, is used once at the very end to see how the model performs on totally unseen data.

It's a common practice in machine learning. We often denote this as 60-20-20 rule. Where, 60 is used for training, 20 is used for validation, and 20 is used for testing. In the given sample, we don't interfere the test set during training or model learning. This can generalize our model to respond to unknown/new data.

The Ethics of Evaluation

Simple terms, "don't cheat". The test set is a clean room. If you look at the test results and then go back to change your model to improve them, the test set is no longer a fair measurement, it has become part of your training process. This leads to biased, over-optimistic results.

Linear Regression for Non-Linear Functions

Let's look at the simplest "machine": Linear Regression. Here, our prediction is just a weighted sum of the inputs: $$\hat{y} = f_\theta(x) = W^\top x$$

The parameters we need to "learn" are the weights $W$. To find the best $W$, we minimize the Mean Squared Error: $$\tilde{C}(W) = \frac{1}{N} \sum_{n=1}^{N} \frac{1}{2} \| y_n - W^\top x_n \|^2_2$$

To use Gradient Descent, we need the gradient (the direction of steepest increase): $$\nabla \tilde{C}(W) = -\frac{1}{N} \sum_{n=1}^{N} (y_n - W^\top x_n) x_n^\top$$

Feature Extraction

Standard linear regression can only draw straight lines, but the real world is often curvy. For example predicting air conditioner sales based on the number of days since a store opened is a good example. Sales fluctuate with the seasons, so a straight line would be useless.

To improve the model, we create a feature from the raw data. For instance, to capture how close a date is to the heat of July, we define a feature $\phi(x) = |m(x) - 7|$, where $m(x)$ is the month of the year.

By transforming $x$ into $\phi(x)$, a simple linear model can understand complex seasonal patterns. This transformation allows the model to make more accurate predictions, such as forecasting air conditioner sales.

The Problem with Manual Features

Historically, people spent years hand-crafting features like SIFT for images, a process known as Feature Extraction. This approach has two significant flaws. First, it requires a great deal of domain knowledge, meaning you need to be an expert in the field, such as meteorology or vision, to design good features. Second, human intuition about what matters is often wrong or incomplete. If we don't want to hand-design these features ourselves, we can build a machine that learns the features automatically, a task that is performed by a Neural Network.

Summary Checklist

The Three-Way Split: $D_{train}$, $D_{val}$, and $D_{test}$ ensure honesty.
The Test Set is Sacred: Never look at it until you are finished.
Linear models are limited: They only work well if you provide the right Features $\phi(x)$.
The Future: Deep Learning aims to replace manual feature extraction with automated learning.