# Machine Learning: Regressions, Regularization and Learning Curves

I'm Piyush Malhotra, a Delhilite who loves to dig Deep in the woods of Artificial Intelligence. I like to find new ways to solve not so new but interesting problems. Fitting new models to data and articulating new ways to manipulate and personify things is what I think my field is all about. When not working or playing with data, you'll find me in the gym or writing new blog posts.

Till now, we have covered:

- Machine Learning Introduction by Example: We used stock price prediction as an example for us to get into Machine Learning.
- Machine Learning from Scratch – Linear Regression: Gave us an insight on the working of the Linear Regression model we used in the first post, and finally
- Machine Learning from Scratch – Logistic Regression: We moved from Regression to classification and got a hang of how Logistic Regression can be quite useful in an important field like Health Care to predict whether a cancer is Malignant or Benign.

It looks like we have covered the basic ground of Supervised Learning methods to reach a plain where we can work with non-linear data.

*In this post, we will work our way through a regression problem but now with a dataset that is not in a simple linear form. That is, the next value in the data is not linearly dependent on the previous value. Using some intuitive ideas, we will see how manipulating datasets is a key ingredient of the Machine Learning recipes. Alongside, we will also limit, regulate and monitor the performance of our model.*

## LINEAR vs NON-LINEAR DATA

Let’s look directly at examples. Here, I’ll be displaying scatter plot of some values lying on the following 2 equations:

**y = 6x + 3 + noise**

and

**y = 6x ^{2} + 3x + 4 + noise**

where, noise is any random value in the range [-5, 5].

Which curve best describes the motivation level in your life? For me, it is the one at the right. While writing the code, I’m all the way high (not as Joe Rogan ft Elon Musk high) with motivation but when my model is not training well my motivation is all the way there at the bottom and at last when the freaking luck of mine works my motivation is all the way high up there!

What does it tell us about the real world data? It tells us that the real-world data is messy and often non-linear. Having non-linear data presents a challenge here. The challenge is that Linear Regression is good at predicting linear data and logistic regression is good at separating linearly separable data. So we need tricks, algorithms and intuitive skills to leap over this barrier.

Let’s look at one such technique here – Polynomial Regression.

## POLYNOMIAL REGRESSION

Making new features out of old ones in data by adding powers of already available features or their products and finally training on this new set of features using a linear model is what we call Polynomial Regression.

### OUR NON-LINEAR DATASET

Let’s generate some random dataset to work on for this post.

```
import numpy as np
import matplotlib.pyplot as plt
m = 100
x1 = 5 * np.random.rand(m, 1) - 2
x2 = 0.7 * x1 ** 2 - 2 * x1 + 3 + np.random.randn(m, 1)
plt.scatter(x1, x2)
plt.show()
```

First let’s apply Linear Regression and see the results:

```
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
lr = LinearRegression()
lr.fit(x1, x2)
preds = lr.predict(x1)
print('RMSE: {:.3f}\nR2: {:3f}'.format(
np.sqrt(mean_squared_error(x2, preds)), r2_score(x2, preds)))
plt.plot(x1, x2, 'b+', label='Datapoints')
plt.plot(x1, preds, 'r-', label='Linear Reg Line')
plt.legend()
plt.show()
```

RMSE: 1.494 R2: 0.653704

As we can see, the ** LinearRegression** model is severely underfitting our dataset. Looking back at the definition of

**we understand that we need to add more features, as powers of already present features, to the mix.**

*Polynomial Regression,*Till now we were predicting x_{2} using x_{1} as:

**x _{2} = b + w_{1}x_{1}**

Now what if we can add powers of x_{1} to the mix till the time we can find the best possible values:

**x _{2} = b + w_{1}x_{1} + w_{2}x_{1}^{2} + … till we get best model**

Note:In case of more features likex,_{1}xand_{2}xpredicting_{3}ywe can formulate new features for polynomial regression as product of powers of these features as well. For example,x_{1}^{3}x_{2}^{2}x_{3}

When we have 1 feature, as in our case here, we can compute new features easily. But what if we have plethora of features? Scikit-learn comes super handy here. The * preprocessing* module of scikit-learn contains one useful class for this purpose –

*. The main argument to the constructor of the class is the maximum degree that we want our features to have.*

**PolynomialFeatures***Remember degree is the sum of power of all the features used in a term, i.e., degree of*

**x**is 6._{1}^{3}x_{2}^{2}x_{3 }* PolynomialFeatures* have three main methods that are useful here:

fit the data for which you need polynomial features. This should be your train data.**fit():**After fit is called, transform your train, validation and test data using this function.**transform():**This method is a combination of above two. We basically fit and transform train data in one go.**fit_transform():**

Let’s look it in action:

```
from sklearn.preprocessing import PolynomialFeatures
poly_feats = PolynomialFeatures(degree=2)
x1_poly = poly_feats.fit_transform(x1)
lr = LinearRegression()
lr.fit(x1_poly, x2)
preds = lr.predict(x1_poly)
print('RMSE: {:.3f}\nR2: {:3f}'.format(
np.sqrt(mean_squared_error(x2, preds)), r2_score(x2, preds)))
plt.plot(x1, x2, 'b+', label='Datapoints')
plt.plot(sorted(x1[:, 0]), preds[np.argsort(x1[:, 0])], 'r', label='Polynomial Reg Curve')
plt.legend()
plt.show()
```

RMSE: 0.937 R2: 0.863764

The performance boost is evident in both the performance metrics – RMSE has decreased and R2 score has increased. The same thing is evident in the plot.

Now that we have worked out a solution, let’s look at higher degree of polynomial features. We will be looking at five possible degrees – 1, 2, 8, 16, 30.

```
def get_preds(x1, x2, degree):
'''A helper function that will compute Polynomial Features
and compute predictions
'''
poly_feats = PolynomialFeatures(degree=degree, include_bias=False)
x1_poly = poly_feats.fit_transform(x1)
lr = LinearRegression()
lr.fit(x1_poly, x2)
return lr.predict(x1_poly)
# degrees we will be working with and corresponding colors
degrees = [1, 2, 8, 16, 30]
cs = ['r', 'g', 'c', 'y', 'k']
# Politting for degrees
plt.figure(figsize=(10, 6))
plt.plot(x1, x2, 'b+', label='Datapoints')
for degree, c in zip(degrees, cs):
# Make predictions for degree
preds = get_preds(x1, x2, degree)
# Plot
plt.plot(sorted(x1[:, 0]), preds[np.argsort(x1[:, 0])], c, label='Degree: {}'.format(degree))
# Print Metrics
print('Degree: {}\tRMSE: {:.3f}'
'\tR2: {:.3f}'.format(degree, np.sqrt(mean_squared_error(x2, preds)), r2_score(x2, preds)))
plt.legend()
plt.show()
```

Degree: 1 RMSE: 1.494 R2: 0.654 Degree: 2 RMSE: 0.937 R2: 0.864 Degree: 8 RMSE: 0.905 R2: 0.873 Degree: 16 RMSE: 0.863 R2: 0.885 Degree: 30 RMSE: 0.821 R2: 0.895

As we increase model complexity (by increasing the degree of polynomial features), we first go from a bad model (degree 1), which has high bias (underfits), to good model (degree 2) and then to another bad model (degree 16 or 30), which has high variance (overfits). So, we need to make sure that the increase in complexity doesn’t hurt our performance.

One way of understanding the generalization capabilities (capability to perform good on unseen data) of our model is by working with *learning curves*.

### PIPELINE IN SCIKIT-LEARN

Since we need two steps * PolynomialFeatures* and

*to be performed in conjunction to each other, we can use sklearn’s*

**LinearRegression***module to group them into one single entity:*

**pipeline**```
from sklearn.pipeline import Pipeline
model = Pipeline([
('poly_feats', PolynomialFeatures(degree=2, include_bias=False)),
('lr', LinearRegression())
])
model.fit(x1, x2)
preds = model.predict(x1)
plt.plot(x1, x2, 'b+', label='Datapoints')
plt.plot(sorted(x1[:, 0]), preds[np.argsort(x1[:, 0])], 'r', label='Polynomial Reg Curve')
plt.legend()
plt.show()
```

## LEARNING CURVES

So what are learning curves? A plot displaying change in metrics value for train and validation sets with change in training set size. Let’s make a general function for us to do so.

```
from sklearn.model_selection import train_test_split
def plot_lcs(model, X, Y, val_split=0.2, metrics=mean_squared_error):
# train val split
x_train, x_val, y_train, y_val = train_test_split(X, Y, test_size=val_split)
# collections of errors:
train_time_loss, val_time_loss = [], []
# iterating over train size:
for train_size in range(1, len(x_train)):
# training
model.fit(x_train[:train_size], y_train[:train_size])
# train and val predictions
train_preds = model.predict(x_train[:train_size])
val_preds = model.predict(x_val)
# metrics calculation
train_time_loss.append(metrics(y_train[:train_size], train_preds))
val_time_loss.append(metrics(y_val, val_preds))
# plotting curves
plt.plot(np.sqrt(train_time_loss), 'r-*', label='Train loss')
plt.plot(np.sqrt(val_time_loss), 'b-+', label='Val loss')
plt.ylabel('RMSE')
plt.xlabel('Training Set Size')
```

Let’s try out our pipelined model from last section and see its performance using learning curves:

```
plt.figure(figsize=(10, 6))
plot_lcs(model, x1, x2)
plt.legend()
plt.show()
```

Trying out same with a model having polynomial features of degree 12:

```
model = Pipeline([
('poly_feats', PolynomialFeatures(degree=12, include_bias=False)),
('lr', LinearRegression())
])
plt.figure(figsize=(10, 6))
plot_lcs(model, x1, x2)
plt.ylim(-1, 4)
plt.legend()
plt.show()
```

Looking closely, we can see that the gap between Validation loss curve and the training loss curve has increased. Along with this, we can also see that the loss for training set is much more lower than the validation set. These two generally are the symptoms of a high variance problem. In other words, they display that the model is overfitting the training data thereby performing poorly on unseen data.

But what if there are features in higher degree polynomials that better describe our predictions than the features used in lower degree polynomials? What if there was a way to regularize the overfitting aspect of a higher degree polynomial regression? Let’s discuss the concept of regularization in the next section.

## REGULARIZATION

We just learnt how to make a model complex. But this led us to an issue where complexity made the model overfit the training data. Now we will ask the model to constraint the features which overfit the data and make itself a little simpler on its own. In other words, we will ask a model with higher degrees of freedom (like 16) to penalize the weights who are making it overfit the data.

Consider a model with only 1 feature * w_{1} *and bias

*. If we limit*

**b***to be 0, then the model will be a little simpler as it only depends on*

**b***. But what if we limit both*

**w**_{1}*and*

**b***to be small? The model will be complex when compared to model with*

**w**_{1}*but simpler compared to the model with no constraints on*

**b=0***and*

**w**_{1}*.*

**b**Even though there are many types of regularization out there, we will be looking forward to two main types:

- L1 Regularization
- L2 Regularization

### L1 REGULARIZATION

L1 regularization represents the addition of a regularized term – few parts of * l1* norm, to the loss function (mse in case of linear regression). Here, if weights are represented as w

_{0}, w

_{1}, w

_{2}and so on, where w

_{0}represents bias term, then their

*norm is given as:*

**l1**We add a few parts of this regularized term to our loss function. Hence, we multiply a regularization coefficient * α* to this term before adding it to the loss function, making our final loss function:

If * α* is close to 0, this essentially becomes a Linear Regression model while if

*is too large, then every*

**α***becomes 0 giving us a flat line.*

**w*** The special thing about this regularization term is that it basically gives us a sparse model.* But what does this mean? L1 regularization basically eliminates weights of least important terms and convert them to 0. This gives us weight vector that has only few non-zero values, hence a sparse model.

Now, let us take a step back and carefully look at the new loss function. Ummmm, what the… Our loss function is non-differentiable since the l1 norm uses absolute values! How in the world should we train our model? Do we use something other than gradient descent?

The answer to the last question is yes as well as no. We use gradient descent but we use our brain to calculate ** subgradients** of the l1 norm. Here, the

**is basically the sign function applied to the weight with which we are differentiating. How? Partial differentiation of**

*subgradient**norm (without the absolute operator) w.r.t*

**l1***will basically be 0 for every other weight*

**w**_{i}*where*

**w**_{j}*and 1 for the weight itself:*

**j ≠ i**Now that we have this, we just add the sign of the weight in front of 1. If the weight is 0, then the gradient is 0 as we have already eliminated the coefficient.

### L2 REGULARIZATION

L2 regularization represents the addition of a regularizaed term – few parts of*norm, to the loss function (mse in case of linear regression). Here, if weights are represented as w*

**l2**_{0}, w

_{1}, w

_{2}and so on, where w

_{0}represents bias term then their

*norm is given as:*

**l2**Same as L1 Regularizer, we add a few parts of this regularized term to our loss function. Hence, we multiply a regularization coefficient * α* to this term before adding it to the loss function, making our final loss function:

Remember, if * α* is close to 0, this essentially becomes a Linear Regression model while if

*is too large, then every*

**α**

**w***will be very close to*0 giving us a flat line.

The nice thing about this regularizer is that it is differentiable. So, we can use gradient descent without using some fancy stuff as subgradients.

Note:It is important to scale data before most regularization techniques as they are sensitive to scale of data.

## LASSO REGRESSION

Just for fun, Lasso stands for * Least Absolute Shrinkage and Selection Operator*.So let’s just call it Lasso Regression.

Lasso Regression is a regularized version of Linear Regression that uses L1 regularization. Scikit-Learn gives us a simple implementation of it. Let’s try it out with various values of alpha – 0, 0.7 and 5 with a model of degree 16.

```
from sklearn.linear_model import Lasso
def get_preds_lasso(x1, x2, alpha):
if alpha == 0:
# Some issues with alpha = 0 in Lasso.
# We know that alpha=0 is essentially LinearRegression
# so we will use it
model = Pipeline([
('poly_feats', PolynomialFeatures(degree=16)),
('lasso', LinearRegression())
])
else:
model = Pipeline([
('poly_feats', PolynomialFeatures(degree=16)),
('lasso', Lasso(alpha=alpha))
])
model.fit(x1, x2)
return model.predict(x1)
alphas = [0, 0.7, 5]
cs = ['r', 'g', 'b']
plt.figure(figsize=(10, 6))
plt.plot(x1, x2, 'b+', label='Datapoints')
for alpha, c in zip(alphas, cs):
preds = get_preds_lasso(x1, x2, alpha)
# Plot
plt.plot(sorted(x1[:, 0]), preds[np.argsort(x1[:, 0])], c, label='Alpha: {}'.format(alpha))
plt.legend()
plt.show()
```

So, we have to set regularization coefficient keeping in mind that it doesn’t start to underfit the model.

## RIDGE REGRESSION

Ridge Regression is a regularized version of Linear Regression that uses L2 regularization. Scikit-Learn gives us a simple implementation of it. Let’s try it out with various values of alpha – 0, 20 and 200 with a model of degree 16.

```
from sklearn.linear_model import Ridge
def get_preds_ridge(x1, x2, alpha):
model = Pipeline([
('poly_feats', PolynomialFeatures(degree=16)),
('lasso', Ridge(alpha=alpha))
])
model.fit(x1, x2)
return model.predict(x1)
alphas = [0, 20, 200]
cs = ['r', 'g', 'b']
plt.figure(figsize=(10, 6))
plt.plot(x1, x2, 'b+', label='Datapoints')
for alpha, c in zip(alphas, cs):
preds = get_preds_ridge(x1, x2, alpha)
# Plot
plt.plot(sorted(x1[:, 0]), preds[np.argsort(x1[:, 0])], c, label='Alpha: {}'.format(alpha))
plt.legend()
plt.show()
```

## ELASTIC NET

Elastic Net is the path of equanimity. Just kidding. It is a combination of Lasso Regression and Ridge Regression. Represented by:

Here * r* represents the mix ratio of the two.

So, which to use when? Well, this advice I took directly from one of the best ever practical Machine Learning book I read – “Hands-On Machine Learning with Scikit-Learn and Tensor Flow: Concepts, Tools, and Techniques to Build Intelligent Systems“. A general rule of thumb is a little regularization is always good. So, avoid plain old Linear Regression. Ridge is a good default! But only if you have intuition that few of the features are only useful go with Lasso or Elastic Net. If the number of features are high, then prefer Elastic Net as Lasso might behave erratically when the number of features are greater than number of training instances.

Let’s try out elastic net ourselves.

```
from sklearn.linear_model import ElasticNet
def get_preds_enet(x1, x2, alpha):
model = Pipeline([
('poly_feats', PolynomialFeatures(degree=16)),
('lasso', ElasticNet(alpha=alpha, l1_ratio=0.5))
])
model.fit(x1, x2)
return model.predict(x1)
alphas = [0, 0.7, 10]
cs = ['r', 'g', 'b']
plt.figure(figsize=(10, 6))
plt.plot(x1, x2, 'b+', label='Datapoints')
for alpha, c in zip(alphas, cs):
preds = get_preds_enet(x1, x2, alpha)
# Plot
plt.plot(sorted(x1[:, 0]), preds[np.argsort(x1[:, 0])], c, label='Alpha: {}'.format(alpha))
plt.legend()
plt.show()
```

This brings us to the end of the post. It looks like we learnt a lot in just one post and covered a lot of ground. Next post will be super important one. We will be covering 1 of the most important machine learning algorithm.

Want me to write on a particular topic of your liking? Comment it down below 🙂

Subscribe to our newsletter and never miss any updates from Predictive Programmer!