# Data
import numpy as np
from logreg import LogisticRegression
from sklearn.datasets import make_blobs
from matplotlib import pyplot as plt
from matplotlib import font_manager
"C:\Windows\Fonts\FiraSans-Regular.ttf")
font_manager.fontManager.addfont("font.family"] = "Fira Sans"
plt.rcParams[all='ignore')
np.seterr(283)
np.random.seed(
# make the data
= 3
p_features = make_blobs(
X, y =500, n_features=p_features - 1, centers=[(-1, -1), (1, 1)]
n_samples
)
= plt.scatter(X[:, 0], X[:, 1], c=y)
fig = plt.xlabel("Feature 1")
xlab = plt.ylabel("Feature 2") ylab
Image credit: European Union, Economies of scope in the aggregation of health-related data, under CC BY 4.0.
Python source: 451-blog/logreg.py at main · doabell/451-blog
Instructions can be found at Optimization for Logistic Regression.
Comparing gradient descents
The Python source file implements three types of logistic regression with gradient descent:
- Batch gradient descent, updating the gradient all at once;
- Stochastic (minibatch) gradient descent, updating the gradient on random subsets;
- Stochastic (minibatch) gradient descent with momentum method.
Note on terminology
I checked Google for a reasonable batch_size
for this blog post’s “stochastic gradient descent”, and was unable to find one.
From this Towards Data Science blog post, it seems that “stochastic gradient descent” refers to an online method that checks one data point at a time. A more common name for the method implemented in LR.fit_stochastic()
is “minibatch” gradient descent, which strikes a balance between stochastic (one at a time) and “batch” (LR.fit()
) gradient descent (all the data points, all at once).
Batch gradient descent
With terminology out of the way, let us first prepare the data.
We generate data points with 2 features and 2 labels that are not linearly separable, as seen below:
We can then fit our logistic regression with batch gradient descent as follows:
= LogisticRegression()
LR_batch =0.1, max_epochs=1000)
LR_batch.fit(X, y, alpha
= plt.subplots(1, 2)
fig, axarr
0].scatter(X[:, 0], X[:, 1], c=y)
axarr[0].set(
axarr[="Feature 1", ylabel="Feature 2",
xlabel=f"Loss = {LR_batch.loss_history[-1]}"
title
)
= np.linspace(-3, 3, 101)
f1
= axarr[0].plot(
p
f1,2] - f1*LR_batch.w[0])/LR_batch.w[1],
(LR_batch.w[="black"
color
)
1].plot(LR_batch.loss_history)
axarr[1].set(xlabel="Iteration number", ylabel="Empirical Risk")
axarr[
plt.tight_layout()
# inspect the fitted value of w
LR_batch.w
array([ 1.82323946, 1.7608397 , -0.10917852])
We observe a resonable splitting line on the left, and a smoothly decreasing loss to the right.
Since we also implemented stochastic “minibatch” gradient descent, with and without momentum, we can plot the loss curves together as follows: (Code adapted from blog post instructions)
= LogisticRegression()
LR
LR.fit_stochastic(
X, y,=1000,
max_epochs=True,
momentum=10,
batch_size=.01
alpha
)
= len(LR.loss_history)
num_steps
plt.plot(+ 1, LR.loss_history,
np.arange(num_steps) ="stochastic gradient (momentum)"
label
)
= LogisticRegression()
LR
LR.fit_stochastic(
X, y,=1000,
max_epochs=False,
momentum=10,
batch_size=.01
alpha
)
= len(LR.loss_history)
num_steps
plt.plot(+ 1, LR.loss_history,
np.arange(num_steps) ="stochastic gradient"
label
)
= LogisticRegression()
LR =.01, max_epochs=1000)
LR.fit(X, y, alpha
= len(LR.loss_history)
num_steps
plt.plot(+ 1,
np.arange(num_steps)
LR.loss_history,="gradient"
label
)
plt.loglog()
= plt.legend()
legend = plt.xlabel("Epoch")
xlab = plt.ylabel("Loss") ylab
We can see different converging speeds for the three types of gradient descent. Here, stochastic “minibatch” gradient descent converges the fastest, then “minibatch” with momentum (slightly slower), then standard gradient descent.
Experiments
Having compared the different ways of gradient descent, we now investigate the hyperparameters. Namely:
- Learning rate \alpha
- Batch size
- Momentum vs. regular “minibatch” stochastic gradient descent
Data
We will generate another dataset with 10 features, visualizing the first two:
# make the data
= 11
p_features = make_blobs(
X, y =500, n_features=p_features - 1, centers=[(-1, -1), (1, 1)]
n_samples
)
= plt.scatter(X[:, 0], X[:, 1], c=y)
fig = plt.xlabel("Feature 1")
xlab = plt.ylabel("Feature 2") ylab
Learning rate
The learning rate determines how much we move along the gradient.
In code , the relevant line, Line 110, reads:
self.w = self.w - alpha * gradient
As such, when the learning rate is too large, gradient descent will exceed the optimal minimum; when it is too small, gradient descent will take longer to reach the optimal minimum.
= [0.05, 0.0005, 1]
alphas
for alpha in alphas:
= LogisticRegression()
LR
LR.fit_stochastic(
X, y,=1000,
max_epochs=False,
momentum=10,
batch_size=alpha
alpha
)
= len(LR.loss_history)
num_steps
plt.plot(+ 1, LR.loss_history,
np.arange(num_steps) =f"alpha = {alpha}",
label=0.6 if alpha == 1 else 1
alpha
)
plt.loglog()
= plt.legend()
legend = plt.xlabel("Epoch")
xlab = plt.ylabel("Loss") ylab
With three different learning rates, we can see a difference in how training loss progressed over time:
- Reasonably small,
alpha = 0.05
: loss decreased quickly, converging in less than 100 iterations. - Too small,
alpha = 0.0005
: loss decreased more smoothly but slowly, and failed to converge in 1000 iterations. - Too large,
alpha = 1
: loss jumped up and down frequently, although it managed to converge in less than 100 iterations.
This shows that indeed, we want a reasonably small learning rate. Small enough to converge, but not too small that converging becomes too slow.
Batch size
We look at another hyperparameter, the batch size.
In our “minibatch” gradient descent, we look at each “batch” separately. The size of these batches, therefore, might determine the speed of gradient descent.
Using our reasonable learning rate of alpha = 0.05
:
from time import perf_counter
= [2, 10, 50, 100, 200, 500]
batch_sizes
for batch_size in batch_sizes:
= LogisticRegression()
LR = perf_counter()
start
LR.fit_stochastic(
X, y,=1000,
max_epochs=False,
momentum=batch_size,
batch_size=0.05
alpha
)= perf_counter()
end
= len(LR.loss_history)
num_steps
plt.plot(+ 1, LR.loss_history,
np.arange(num_steps) =f"batch size = {batch_size}, {(end - start)*1000:.2f}ms",
label=0.7
alpha
)
plt.loglog()
= plt.legend()
legend = plt.xlabel("Epoch")
xlab = plt.ylabel("Loss") ylab
Here, using otherwise identical hyperparameters, we observe that batch size does affect convergence speed and runtime.
Although a batch size of 2 converged the fastest, the program runtime was longer because of more calculations per loop.
A batch size of 50 converged rather quickly, and had a decently short runtime as well.
Larger batch sizes took longer to converge. At a batch size of 500, this is effectively regular gradient descent, the one callable with LR.fit()
.
In conclusion, a reasonable batch size would make gradient descent converge faster, and improve the program runtime as well.
Momentum
for momentum in [False, True]:
= LogisticRegression()
LR = perf_counter()
start
LR.fit_stochastic(
X, y,=100,
max_epochs=momentum,
momentum=20,
batch_size=0.5
alpha
)= perf_counter()
end
= len(LR.loss_history)
num_steps
plt.plot(+ 1, LR.loss_history,
np.arange(num_steps) =f"momentum {momentum}, {(end - start)*1000:.2f}ms",
label=0.7
alpha
)
plt.loglog()
= plt.legend()
legend = plt.xlabel("Epoch")
xlab = plt.ylabel("Loss") ylab
With a large-enough learning rate of 0.5, we observe the power of the momentum method in gradient descent.
When using the momentum method, “minibatch” stochastic gradient descent converged in around 10 iterations. Not using the momentum method resulted in a much slower convergence.