# Understanding Machine Learning: Unraveling the Mechanisms of Learning
Written on
Chapter 1: The Concept of Machine Learning
In today's world dominated by artificial intelligence, the phrase "machine learning" has gained widespread recognition, extending beyond just data science and tech circles. Whether it’s curating our playlists or aiding in medical diagnostics, machine learning is reshaping our interactions with technology. Personally, I find the term AI to be overused and often misapplied; most discussions actually pertain to machine learning rather than true artificial intelligence. So, what does "learning" signify for a machine?
This article seeks to clarify the concept of machine learning by examining it from various angles. We will begin with a straightforward definition that encapsulates the core of machine learning, then transition into more intricate explanations that illuminate the underlying mechanisms and theories of this groundbreaking technology. By the conclusion, you'll grasp what it means for a machine to "learn" and why this process is essential for the evolution of AI.
The Essence of Learning
Understanding the nature of learning is no easy task. Humans acquire knowledge through diverse experiences, gradually refining our worldview. In contrast, machines approach learning differently — they analyze data. For a computer, learning is about enhancing performance on a specific task through data scrutiny. The goal is for the machine to execute this task proficiently with new, unseen data post-learning.
The primary distinction between traditional algorithms and machine learning algorithms lies in the fact that machine learning models are not explicitly programmed to perform a task. Instead, they are designed to discover the most effective solution by iteratively fine-tuning their parameters based on the data provided.
A Mathematical Perspective
From a mathematical standpoint, we can simplify the problem as follows: Given x and y such that f(x) = y, how do we determine the optimal approximation of the function f? Learning f while being presented with x and y involves incrementally adjusting a set of parameters θ to ensure our current estimation of f is the best one based on the available data.
Although variations of this definition exist—sometimes algorithms receive only x without y—we'll maintain this definition for the sake of clarity.
A Basic Example: Linear Regression
Consider we have collected data on two variables, such as housing prices and their sizes, or the incomes of individuals relative to their education levels. Our objective is to discern the relationship between these two variables. If we assume this relationship is linear, we aim to identify the straight line that best represents the connection between them. In mathematical terms, we seek to find the optimal approximation of f(x) expressed as a straight line equation.
Here, x might represent the housing price, while y symbolizes its size. Our parameters θ in this instance would be the values a and b that we want to learn. Our next step is to determine a method for learning these parameters.
An Iterative Approach
As previously mentioned, the key difference compared to traditional algorithms is that we won’t provide the program with a direct way to reach the precise answer. Instead, we will present a subset of data at each stage of the learning process and iteratively refine our estimation. While this may seem abstract, it will become clearer as we define relevant concepts. The crucial point is that this is an iterative process, meaning our estimate at time i+1 should improve upon our estimate at time i.
The Loss Function
Our aim is to identify the best approximation of the function. To assert that it is indeed the best, we must establish a value that indicates how closely our estimate aligns with the actual data. Ultimately, we will seek to minimize this value, thereby ensuring our approximation is as close as possible to the data. This criterion is termed the error, denoted as J. We aim to find the parameters a and b such that J is minimized. Various functions can be employed for the error, commonly referred to as the loss or cost function. A frequently chosen option for linear regression is the mean squared error, defined as follows:
MSE = (1/n) * sum((y - ŷ)^2)
where ŷ is our current prediction of y. The objective is that when we optimize this function based on known x and y values, our predictions will accurately forecast unseen values (in this case, y when given new x).
How do we ensure that we select our parameters to reduce this value over time? This inquiry leads us to a fundamental concept in machine learning known as gradient descent.
Gradient Descent
One final element is necessary before we can establish the algorithm. Although we can readily compute the error using our estimate of f and the actual data, how do we determine the direction to adjust for the next set of parameters θ = {a, b} to decrease this error? This is where gradient descent becomes relevant. Gradient descent is an optimization technique used to minimize the loss function by iteratively moving in the direction of steepest descent, defined by the negative gradient.
In practice, we calculate the derivative of the loss function concerning each learnable parameter θ and then subtract this value from the initial parameter to advance in the direction of the steepest descent. The learning rate, denoted as alpha, is a relatively small parameter to avoid overshooting parameter updates and missing a local minima for J. If the learning rate is excessively small, convergence is slow; conversely, if too high, convergence may be rapid but unstable.
At each step of the algorithm, we recompute the error J to assess our current estimate. If we find the value satisfactory, we conclude the process; if not, we take another step and update the parameters.
The Code
Let’s explore a pseudo code that illustrates how all these components fit together.
# Input data
X = [...]
y = [...]
m = len(X)
# Initializing parameters
a = 0
b = 0
error = +inf
# Setting the learning rate
alpha = 0.1
# Defining the threshold for error
while(error > threshold):
y_hat = a * X + b # Vector of same dimension as X and y
partial_a = (-2/m) * sum(X * (y - y_hat))
partial_b = (-2/m) * sum(y - y_hat)
a = a - alpha * partial_a
b = b - alpha * partial_b
error = (1/m) * sum((y - y_hat)^2) # MSE
Learning Versus Classical Algorithms
We have described the learning process for linear regression. If we were to utilize a classical algorithm, we would frame this as an optimization problem, aiming to minimize the residual sum of squares by setting its gradient to zero.
In simpler cases, solving this set of equations directly might be easier than using gradient descent. However, as we increase the dimensions, this formula becomes more intricate and computationally expensive. Gradient descent avoids such scalability challenges.
Why do we say that one method learns while the other does not? Essentially, gradient descent learns the parameters a and b through continuous refinement, ensuring they provide a suitable approximation of the actual data. In contrast, a linear algebra solution merely computes the optimal solution without an iterative learning process.
A Complex Example: Neural Networks
Unlike linear regression, which presumes a straightforward linear connection between inputs and outputs, neural networks are crafted to capture intricate, nonlinear patterns within data. This ability arises from their architecture, which emulates the structure of the human brain, facilitating learning and generalization from extensive datasets.
The Neuron
In a neural network, the fundamental component is the neuron (or perceptron). Each neuron conducts a basic computation that, when combined with numerous other neurons, empowers the network to address complex problems. A neuron consists of inputs, weights, a bias term, and an activation function, all culminating in the neuron's final output.
For instance, if we are classifying images of dogs and cats, the input might be an array of numbers corresponding to pixels in the image. For black-and-white images, each number can range from 0 to 1, depending on pixel brightness.
Each input is assigned a weight, which is learned during the training process. These weights signify the importance of each input in influencing the neuron's output. The neuron computes a weighted sum of its inputs, known as the activation of the neuron, represented mathematically as:
z = w_1 * x_1 + w_2 * x_2 + ... + b
Here, b denotes the bias term, an additional learned parameter allowing the neuron to learn patterns that do not pass through the origin.
The weighted sum z is then processed through an activation function g(z), introducing non-linearity into the model and enabling the network to learn and express complex patterns. Common activation functions include the sigmoid function and tanh, but for simplicity, we will use the Rectified Linear Unit (ReLU) function.
The Network
Transitioning from individual neurons to neural networks greatly enhances a model's ability to capture complex data patterns. How does this work in practice? Two main processes come into play: layering and stacking.
- Layering: This involves combining multiple neurons into layers, each with its own set of weights and biases.
- Stacking: Once the layers are defined, they are layered on top of one another. The output of one layer serves as the input for the next, with the first layer receiving raw input data and the last layer yielding the model's output. Each intervening layer is referred to as a hidden layer, with outputs from each neuron feeding into every neuron in the subsequent layer.
The output layer can consist of a single neuron or multiple neurons, depending on the task. For instance, when predicting whether an image contains a dog or a cat, the output layer may consist of a neuron that outputs 1 for a dog and 0 for a cat.
Back-Propagation: Learning Mechanism
How can we apply gradient descent in such a complex model with potentially thousands of parameters? The answer lies in the back-propagation algorithm. Back-propagation computes the gradient of the loss function concerning each weight using the chain rule, enabling the network to learn from errors and iteratively improve predictions.
The algorithm is divided into two phases: the forward pass and the backward pass. In the forward pass, input data flows through the network layer by layer. Each neuron computes a weighted sum of inputs, adds a bias, and applies a non-linear activation function until the final output is generated. The output is then compared to the actual target values, allowing us to calculate the loss, which quantifies the prediction error.
The backward pass runs in reverse, from the output layer back to the input. Its role is to adjust network weights to minimize the error concerning the input and output just processed. This phase attempts to modify network parameters so that when the same input is presented again, the network can produce the correct output.
When we compute the partial derivatives of the loss function concerning each network parameter, we can update the weights by moving in the direction of steepest descent. This involves subtracting the partial derivative multiplied by a coefficient (the learning rate) from the original weight value.
Conclusion
In this exploration of machine learning, we have unpacked the fundamental concepts enabling machines to learn from data. From linear regression to complex neural networks and back-propagation, we have observed how machines can iteratively enhance their performance on designated tasks.
To succinctly answer the original question: what we define as "learning" for a machine is the process of enhancing task performance over time by adjusting internal parameters to minimize errors based on provided data, leading to improved predictions or decisions. While this discussion covers essential topics, many others remain, such as the interplay between human and machine learning, hyper-parameter tuning, and issues like underfitting and overfitting. However, those explorations are reserved for future discussions.