import seaborn as sns
import matplotlib.pyplot as plt
import warnings
from matplotlib import font_manager
"C:\Windows\Fonts\FiraSans-Regular.ttf")
font_manager.fontManager.addfont('ignore')
warnings.filterwarnings(
sns.set_theme()set(font="Fira Sans") sns.
Image credits: Artwork by @allison_horst. GitHub link
Instructions can be found at Classifying Palmer Penguins.
Explore
The penguins dataset features measurements for three penguin species observed in the Palmer Archipelago, Antarctica (more information in link).
Information on the data contained: Penguin size, clutch, and blood isotope data for foraging adults near Palmer Station, Antarctica
import pandas as pd
= "https://raw.githubusercontent.com/middlebury-csci-0451/CSCI-0451/main/data/palmer-penguins/train.csv"
train_url = pd.read_csv(train_url) train
Lengths
We examine the first quantitative elements of the dataset - the penguins’ Culmen and Flipper lengths, and see how they relate to the penguins’ species.
# wrap legend
from textwrap import fill
= train.copy(deep=True)
train_map "Species"] = train_map["Species"].apply(fill, width=20) train_map[
= sns.lmplot(
lengthsPlot
train_map,="Culmen Length (mm)",
x="Flipper Length (mm)",
y="Species"
hue )
These are three pretty distinct clusters, though with some overlap!
As such, we could probably say that the penguins with the smallest culmen and flipper lengths are probably Adelie penguins.
Blood isotopes
Next, we look at isotope data for these penguins:
= sns.relplot(
isotopesPlot ="Delta 15 N (o/oo)", y="Delta 13 C (o/oo)", hue="Species") train_map, x
These clusters turned out to be pretty inseparable.
Does blood isotope data have to do with where these penguins live?
= sns.relplot(
isotopesIslandPlot ="Delta 15 N (o/oo)", y="Delta 13 C (o/oo)", hue="Island") train_map, x
The answer is probably no; we see no patterns at all here.
Island
So where do these penguins live?
= sns.displot(train_map, x="Island", y="Species", aspect=1.4) livePlot
"Island", "Sex", "Species"]).count()[["Region"]] train.groupby([
Region | |||
---|---|---|---|
Island | Sex | Species | |
Biscoe | . | Gentoo penguin (Pygoscelis papua) | 1 |
FEMALE | Adelie Penguin (Pygoscelis adeliae) | 19 | |
Gentoo penguin (Pygoscelis papua) | 42 | ||
MALE | Adelie Penguin (Pygoscelis adeliae) | 16 | |
Gentoo penguin (Pygoscelis papua) | 54 | ||
Dream | FEMALE | Adelie Penguin (Pygoscelis adeliae) | 20 |
Chinstrap penguin (Pygoscelis antarctica) | 29 | ||
MALE | Adelie Penguin (Pygoscelis adeliae) | 20 | |
Chinstrap penguin (Pygoscelis antarctica) | 27 | ||
Torgersen | FEMALE | Adelie Penguin (Pygoscelis adeliae) | 18 |
MALE | Adelie Penguin (Pygoscelis adeliae) | 19 |
We observe that, Gentoo penguins only live on the Biscoe Islands, and Chinstrap penguins on Dream Island; Adelie penguins happily (or at least hopefully happily) live on all three islands.
Also, Torgensen Island only has Adelie penguins; the other islands have at least two species.
Looking at sex, the penguins on each island are pretty evenly split between male and female.
Note this might only be specific to our test-train split, so we need to be cautious of not over-fitting - we cannot say, for example, that a penguin on Torgensen Island is 100% Adelie.
Model
Prepare data
We prepare Species
as labels, and then other features with pd.get_dummies()
.
Rows with invalid Sex
fields or fields with NA
are dropped.
Several identifying columns, like Individual ID
, are also dropped.
from sklearn.preprocessing import LabelEncoder
= LabelEncoder()
le "Species"])
le.fit(train[
def prepare_data(df):
= df.drop(
df "studyName", "Sample Number", "Individual ID",
["Date Egg", "Comments", "Region"], axis=1
)= df[df["Sex"] != "."]
df = df.dropna()
df = le.transform(df["Species"])
y = df.drop(["Species"], axis=1)
df = pd.get_dummies(df)
df return df, y
= prepare_data(train)
X_train, y_train
X_train.head()
Culmen Length (mm) | Culmen Depth (mm) | Flipper Length (mm) | Body Mass (g) | Delta 15 N (o/oo) | Delta 13 C (o/oo) | Island_Biscoe | Island_Dream | Island_Torgersen | Stage_Adult, 1 Egg Stage | Clutch Completion_No | Clutch Completion_Yes | Sex_FEMALE | Sex_MALE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 45.1 | 14.5 | 215.0 | 5000.0 | 7.63220 | -25.46569 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 |
2 | 41.4 | 18.5 | 202.0 | 3875.0 | 9.59462 | -25.42621 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 |
3 | 39.0 | 18.7 | 185.0 | 3650.0 | 9.22033 | -26.03442 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
4 | 50.6 | 19.4 | 193.0 | 3800.0 | 9.28153 | -24.97134 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 |
5 | 33.1 | 16.1 | 178.0 | 2900.0 | 9.04218 | -26.15775 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 |
Feature selection
We need to select 3 features for our classification model.
Since one of these need to be qualitative, we may have to look at more than 3.
From scikit-learn
, 1.13. Feature Selection:
Univariate statistical tests
We can apply univariate statistical tests to find the statistically “best” features.
For classification, we have three scores available:
chi2
is for contigency tables, or at least non-negative values. Since the columnDelta 13 C (o/oo)
contains negative values, we cannot use this score.f_classif
computes the ANOVA F-value.mutual_info_classif
estimates mutual information for a discrete target variable.
We use SelectKBest
to remove all but the best K features:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
= SelectKBest(f_classif, k=3)
selector = selector.fit_transform(X_train, y_train)
X_new # scikit-learn does not keep names ):
# https://stackoverflow.com/a/41041230
X_train.loc[:, selector.get_support()].head()
Culmen Length (mm) | Flipper Length (mm) | Body Mass (g) | |
---|---|---|---|
1 | 45.1 | 215.0 | 5000.0 |
2 | 41.4 | 202.0 | 3875.0 |
3 | 39.0 | 185.0 | 3650.0 |
4 | 50.6 | 193.0 | 3800.0 |
5 | 33.1 | 178.0 | 2900.0 |
We see that these three are all quantitative.
Therefore, we keep the top 2:
= SelectKBest(f_classif, k=2)
quant_selector = quant_selector.fit_transform(X_train, y_train)
X_new # scikit-learn does not keep names ):
# https://stackoverflow.com/a/41041230
X_train.loc[:, quant_selector.get_support()].head()
Culmen Length (mm) | Flipper Length (mm) | |
---|---|---|
1 | 45.1 | 215.0 |
2 | 41.4 | 202.0 |
3 | 39.0 | 185.0 |
4 | 50.6 | 193.0 |
5 | 33.1 | 178.0 |
And then find the best qualitative feature:
= SelectKBest(f_classif, k=6)
qual_selector = qual_selector.fit_transform(X_train, y_train)
X_new # scikit-learn does not keep names ):
# https://stackoverflow.com/a/41041230
X_train.loc[:, qual_selector.get_support()].head()
Culmen Length (mm) | Culmen Depth (mm) | Flipper Length (mm) | Body Mass (g) | Delta 13 C (o/oo) | Island_Biscoe | |
---|---|---|---|---|---|---|
1 | 45.1 | 14.5 | 215.0 | 5000.0 | -25.46569 | 1 |
2 | 41.4 | 18.5 | 202.0 | 3875.0 | -25.42621 | 0 |
3 | 39.0 | 18.7 | 185.0 | 3650.0 | -26.03442 | 0 |
4 | 50.6 | 19.4 | 193.0 | 3800.0 | -24.97134 | 0 |
5 | 33.1 | 16.1 | 178.0 | 2900.0 | -26.15775 | 0 |
The result seems to be on which island does the penguin reside.
We include all three Island
features since only checking whether they live on Biscoe does not make sense.
= ['Culmen Length (mm)', 'Flipper Length (mm)',
cols 'Island_Biscoe', 'Island_Dream', 'Island_Torgersen']
Performance on Logistic Regression
We can also recursively consider smaller sets of features, and evaluate how this performs on a certain model.
The latter, with cross validation, is availiable as a built-in function RFECV
.
For the cross-validation step, we use a strafied 5-fold cross-validator on logistic regression.
from sklearn.feature_selection import RFECV
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
= RFECV(
rfecv =LogisticRegression(),
estimator=1,
step=KFold(10),
cv="accuracy",
scoring=2
min_features_to_select
)
rfecv.fit(X_train, y_train) X_train.loc[:, rfecv.get_support()].head()
Culmen Length (mm) | Culmen Depth (mm) | Delta 15 N (o/oo) | Delta 13 C (o/oo) | Island_Dream | Sex_FEMALE | |
---|---|---|---|---|---|---|
1 | 45.1 | 14.5 | 7.63220 | -25.46569 | 0 | 1 |
2 | 41.4 | 18.5 | 9.59462 | -25.42621 | 0 | 0 |
3 | 39.0 | 18.7 | 9.22033 | -26.03442 | 1 | 0 |
4 | 50.6 | 19.4 | 9.28153 | -24.97134 | 1 | 0 |
5 | 33.1 | 16.1 | 9.04218 | -26.15775 | 1 | 1 |
RFECV
ranks 6 features as equally important - but why?
Here are the importance scores that RFECV
determined:
X_train.columns
rfecv.ranking_= pd.DataFrame(columns=X_train.columns)
d # Use `df.loc[len(df)] = arr` – rafaelc Oct 8, 2019 at 19:43
# https://stackoverflow.com/q/58292901
"Importance"] = rfecv.ranking_
d.loc[ d
Culmen Length (mm) | Culmen Depth (mm) | Flipper Length (mm) | Body Mass (g) | Delta 15 N (o/oo) | Delta 13 C (o/oo) | Island_Biscoe | Island_Dream | Island_Torgersen | Stage_Adult, 1 Egg Stage | Clutch Completion_No | Clutch Completion_Yes | Sex_FEMALE | Sex_MALE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Importance | 1 | 1 | 5 | 9 | 1 | 1 | 2 | 1 | 4 | 8 | 6 | 7 | 1 | 3 |
And here is the mean test accuracy of the process with different number of features selected:
= len(rfecv.cv_results_["mean_test_score"])
n_scores
plt.figure()"Number of features selected")
plt.xlabel("Mean test accuracy")
plt.ylabel(
plt.errorbar(range(2, n_scores + 2),
"mean_test_score"],
rfecv.cv_results_[=rfecv.cv_results_["std_test_score"],
yerr
)"Recursive Feature Elimination with correlated features")
plt.title( plt.show()
We see that, with 10-fold cross-validation, the mean test accuracy only got to 100% after selecting 6 features.
Therefore, RFECV
is marking them as “equally important”.
We are thus resorting to the columns we selected in Univariate statistical tests.
Modeling
Logistic regression
The modeling process is more straightforward. Here, we use logistic regression:
from sklearn.linear_model import LogisticRegression
= LogisticRegression()
LR
LR.fit(X_train[cols], y_train) LR.score(X_train[cols], y_train)
0.96875
The model performed well, although it did not reach 100% accuracy on the training set.
Random forest
When we maintaining an ensemble of decision trees, we can let them vote on the best category.
This method was state-of-the-art before the rise of neural networks. 1
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
= {
param_grid 'n_estimators': [5, 10, 15, 20, 25, 30],
'max_depth': [2, 5, 7, 9, 13]
}
= RandomForestClassifier(max_features=1)
RF
= GridSearchCV(RF, param_grid, cv=10)
GRF
GRF.fit(X_train[cols], y_train) GRF.best_params_
{'max_depth': 7, 'n_estimators': 15}
Here, we use a grid search with cross-validation (GridSearchCV
) to identify the best hyperparameters among a predefined param_grid
.
Our GridSearchCV
has cross-validated our Random Forest models, and selected the best parameters.
We can then access our best estimator with GRF.best_estimator_
.
Testing
Now that we have two models, we can test their performance on the test set.
Logistic regression
First, gather the data and run logistic regression on it:
from sklearn.preprocessing import LabelEncoder
= LabelEncoder()
le "Species"])
le.fit(train[
def prepare_data(df):
= df.drop(["studyName", "Sample Number", "Individual ID",
df "Date Egg", "Comments", "Region"], axis=1)
= df[df["Sex"] != "."]
df = df.dropna()
df = le.transform(df["Species"])
y = df.drop(["Species"], axis=1)
df = pd.get_dummies(df)
df return df, y
= prepare_data(train)
X_train, y_train
X_train.head()
Culmen Length (mm) | Culmen Depth (mm) | Flipper Length (mm) | Body Mass (g) | Delta 15 N (o/oo) | Delta 13 C (o/oo) | Island_Biscoe | Island_Dream | Island_Torgersen | Stage_Adult, 1 Egg Stage | Clutch Completion_No | Clutch Completion_Yes | Sex_FEMALE | Sex_MALE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 45.1 | 14.5 | 215.0 | 5000.0 | 7.63220 | -25.46569 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 |
2 | 41.4 | 18.5 | 202.0 | 3875.0 | 9.59462 | -25.42621 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 |
3 | 39.0 | 18.7 | 185.0 | 3650.0 | 9.22033 | -26.03442 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
4 | 50.6 | 19.4 | 193.0 | 3800.0 | 9.28153 | -24.97134 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 |
5 | 33.1 | 16.1 | 178.0 | 2900.0 | 9.04218 | -26.15775 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 |
= "https://raw.githubusercontent.com/middlebury-csci-0451/CSCI-0451/main/data/palmer-penguins/test.csv"
test_url = pd.read_csv(test_url)
test
= prepare_data(test)
X_test, y_test LR.score(X_test[cols], y_test)
0.9411764705882353
This score is less than we had in the training set, and also not what this blog post expected (100%).
We can take a look at what happened through plotting the decision regions (code source):
from matplotlib import pyplot as plt
import numpy as np
from matplotlib.patches import Patch
def plot_regions(model, X, y):
= X[X.columns[0]]
x0 = X[X.columns[1]]
x1 = X.columns[2:]
qual_features
= plt.subplots(1, len(qual_features), figsize=(7, 3))
fig, axarr
# create a grid
= np.linspace(x0.min(), x0.max(), 501)
grid_x = np.linspace(x1.min(), x1.max(), 501)
grid_y = np.meshgrid(grid_x, grid_y)
xx, yy
= xx.ravel()
XX = yy.ravel()
YY
for i in range(len(qual_features)):
= pd.DataFrame({
XY 0]: XX,
X.columns[1]: YY
X.columns[
})
for j in qual_features:
= 0
XY[j]
= 1
XY[qual_features[i]]
= model.predict(XY)
p = p.reshape(xx.shape)
p
# use contour plot to visualize the predictions
="jet", alpha=0.2, vmin=0, vmax=2)
axarr[i].contourf(xx, yy, p, cmap
= X[qual_features[i]] == 1
ix # plot the data
=y[ix], cmap="jet", vmin=0, vmax=2)
axarr[i].scatter(x0[ix], x1[ix], c
set(
axarr[i].=X.columns[0],
xlabel=X.columns[1]
ylabel
)
= []
patches for color, spec in zip(["red", "green", "blue"], ["Adelie", "Chinstrap", "Gentoo"]):
=color, label=spec))
patches.append(Patch(color
="Species", handles=patches, loc="best")
plt.legend(title
plt.tight_layout()
plot_regions(LR, X_test[cols], y_test)
On each island, it looked like the logistic regression found approximate lines that separated each species.
However, there is some overlap between each region, so accuracy was less than 100%.
Random forest
How did random forest do? Let us find out.
GRF.best_estimator_.score(X_test[cols], y_test)
0.9705882352941176
This is marginally better, but still not quite at 100% yet.
Again, here are the decision boundaries:
plot_regions(GRF.best_estimator_, X_test[cols], y_test)
Here, the shape of the decision boundaries suggest overfitting - perhaps we should revisit param_grid
and find better hyperparameters.
Footnotes
Gomes, H.M., Bifet, A., Read, J. et al. Adaptive random forests for evolving data stream classification. Mach Learn 106, 1469–1495 (2017). https://doi.org/10.1007/s10994-017-5642-8↩︎