# Machine Learning from Scratch – Linear Regression

In my last post, we went over a crash course on Machine Learning and its type. We also developed a Stock Price Prediction app using Machine Learning library scikit-learn. In this post we will develop the same application but without using scikit and developing the concepts from scratch.

Often the concepts are taught but not implemented. Implementation is not only crucial for learning but also a vital player in development of insights that gives developer ways to fine tune model based on the task at hand.

This post requires extensive knowledge of NumPy.

Before going ahead – preprocessing data that we did in last post required a lot of code over-head. In this post we need to overlook that code so as to keep linear regression in focus. Thus I have made some helper functions to do that for us. Download the code from here.

We will be learning certain preprocessing techniques and their usage, advantages as well as disadvantages in future posts.

## SUPERVISED LEARNING

In last post, we discussed what we mean by supervised learning and gave you a gist of how we train a supervised learning model.

Supervised Learning can be further categorized into 2 categories:

### REGRESSION

When the labels are real valued numbers with no distinct classes. In other words our model needs to predict values which are continuous. Examples, weight of a person, price of house etc.

### CLASSIFICATION

When the labels are distinct classes or categories. In other words our model needs to predict values which are discrete. Example, positive/negative review, possible alphabets (A-Z), possible digits (0-9).

## LET’S BUILD OUR LINEAR REGRESSION MODEL

Our imports:

```from utils import *
import numpy as np
```

Class definition of Linear Regression model will look something like this.

```class LinearRegression:
def __init__(self, learning_rate, n_iter):
pass

def fit(self, X_train, Y_train):
pass

def predict(self, X):
pass

def error(self, Y_true, Y_pred):
pass

def normalize(self, X, Y=None):
pass
```

Here,

• __init__(): Constructor takes only two params – learning_rate and n_iter. I’ll be defining what learning_rate means ahead. While n_iter means how many times we iterate in training loop.
• fit(): It takes 2 values a matrix X_train and a vector Y_train, and train the model’s weights to reduce the error. X_train is a matrix containing our training data. Each row being a data point and each column being a feature of the data. Y_train is a vector of ground truth values for training data. Each value in Y_train represents the ground truth for corresponding row in X_train.
• predict(): It takes a matrix X. Model tries to predict ground truth values for X using weights trained using fit() method. Structure of X is same as that of X_train.
• error(): This function takes two values – Y_true and Y_pred. Y_true is the true ground truth values and Y_pred is the predicted values by the model. The function returns the error between the two using an error function we will define ahead.
• normalize(): This function makes sure that the incoming data have mean 0 and standard deviation 1. This speeds up the training procedure of Gradient Descent (discussed ahead), as well as makes sure that we doesn’t end up large values which are usually a least favorite thing for error function we will use ahead.

Normalization method and constructor are perhaps the easiest one to define. Let’s define them. We will assume that we already have 4 object variables with object:

• x_mean: Mean of training data. Will be collected in fit() method.
• x_stddev: Standard deviation of training data. Will be collected in fit() method.
• y_mean: Mean of training labels. Will be collected in fit() method.
• y_stddev: Standard deviation of training labels. Will be collected in fit() method.
```class LinearRegression:
def __init__(self, learning_rate, n_iter):
self.learning_rate = learning_rate
self.n_iter = n_iter

def fit(self, X_train, Y_train):
pass

def predict(self, X):
pass

def error(self, Y_true, Y_pred):
pass

def normalize(self, X, Y=None):
X = (X - self.x_mean) / self.x_stddev

if Y is None:
return X

Y = (Y - self.y_mean) / self.y_stddev
return X, Y
```

### HOW LINEAR REGRESSION MODEL WORKS

Let’s take the example of our stock price prediction app we developed in last post. We used `open`, `high`, `low` and `close` values of a stock for current day to predict `open` for the next day. This is what happened behind the scenes in a fully trained linear regression model that knows how to predict a correct value:

• The model had 4 weights (values) (w1, w2, w3, w4) each corresponding to `open`, `high`, `low` and `close`.
• Say value of:
• `open` be x1
• `high` be x2
• `low` be x3
• `close` be x4
• So the value of `open` for next day is predicted using w1x1 + w2x2 + w3x3 + w4x4

Let’s visualize this:

### VECTORIZING  PREDICTION PROCEDURE

Say you have m examples to predict on. If one example have n features then we can take all these examples in as a matrix X with m rows and n columns. Consider following matrix for our stock price prediction app as input.

If we take a vector W consisting of our four weights (w1, w2, w3, w4) of linear regression model for stock price prediction and do a dot product of matrix X with vector W we get a vector Ŷ:

Since, X is a matrix of shape m by 4 and W is a vector of shape 4 by 1. There dot product, Ŷ, will be a vector of shape m by 1 with each value of Ŷ being the prediction of the model for corresponding row in X.

We just found a procedure to effectively parallelize computation of multiple predictions!

Now that we are well into post that we can start defining important methods of our class!

NOTE: We will define function for initializing weights and bias ahead.

```class LinearRegression:
def __init__(self, learning_rate, n_iter):
self.learning_rate = learning_rate
self.n_iter = n_iter

def fit(self, X_train, Y_train):
pass

def predict(self, X):
X = self.normalize(X)
preds = np.dot(X, self.weights) + self.bias
# return unnormalized predictions
return preds * self.y_stddev + self.y_mean

def error(self, Y_true, Y_pred):
pass

def normalize(self, X, Y=None):
X = (X - self.x_mean) / self.x_stddev

if Y is None:
return X

Y = (Y - self.y_mean) / self.y_stddev
return X, Y
```

## MEAN-SQUARED ERROR

Till now we discussed how Linear Regression model predicts label given features. With every prediction made there must be some error incurred. We calculate this error by using mse or mean-squared error. It is the mean of the squared difference between actual and predicted value. That’s a mouthful!

Let’s understand it one step at a time:

• Take difference of all the predicted values and corresponding actual values. Say, Ypred is a vector of all the predictions and Yactual is a vector of all the ground truth values. Then doing an element-wise subtraction we will get a vector diff with each value being the difference between corresponding predicted values and ground truth values.
• Square all the differences
• Find the mean of all squared differences.

Let’s add MSE to our class now!!

```class LinearRegression:
def __init__(self, learning_rate, n_iter):
self.learning_rate = learning_rate
self.n_iter = n_iter

def fit(self, X_train, Y_train):
pass

def predict(self, X):
X = self.normalize(X)
preds = np.dot(X, self.weights) + self.bias
# return unnormalized predictions
return preds * self.y_stddev + self.y_mean

def error(self, Y_true, Y_pred):
diff = Y_true - Y_pred
squared_diff = diff ** 2
return np.mean(squared_diff)

def normalize(self, X, Y=None):
X = (X - self.x_mean) / self.x_stddev

if Y is None:
return X

Y = (Y - self.y_mean) / self.y_stddev
return X, Y
```

But before going ahead we will have to understand that why we used MSE!

### WHY USE MEAN-SQUARED ERROR (MSE)?

MSE has some unique properties:

• Squaring emphasize larger distances. Thus, the error increases whenever our weights or data have high values (hence normalization). This means whenever one weight or feature gets too much attention the error goes up. This is a rather unique property and will help our optimization technique (discussed in next section) to reduce these weights and give each feature small weights.
• If we take the value of error for every point for mean and not their square then their might be a chance that the negative values will cancel out effect of positive values thus resulting in a low overall loss.
• What if we take absolute values of loss then? This might be a meaningful approach and might be helpful in some cases as we will see in some of our future tutorials. But for time being we can just memorize that absolute value assigns equal weight to the spread of data whereas squaring emphasizes the extremes.
• Lastly, the most important thing for our optimization technique (discussed in next section) is that MSE is continuously differentiable and we can prove easily that MSE can converge to a minimum value of loss! This means we can have low error while predicting new values.

So how are we going to use MSE? Well as mentioned above we are going to use what we call an optimization technique. An optimization technique can be considered as any algorithm that minimizes or maximizes certain value within certain constraints using some heuristics.

In other words we use some set of intuitive rules to see in which direction of weights does the value of MSE minimizes and then we nudge the weights in that direction by a small value. Let’s discuss the backbone of many Machine Learning algorithms, the optimization technique known as Gradient Descent!!

Gradient Descent is an optimization technique that, as the name suggests, uses the gradients of the loss function w.r.t to weights to find the direction of weights in which we descent the loss curve. That’s way too complex at first read. We will break it into steps and make it easy for ourselves to understand it:

1. Our loss function vs weights curve has a direction of descent. So we know that for a particular combination of weights this loss will be minimum.
2. When we differentiate loss function w.r.t each weight separately it gives us the direction of weights in which the loss increases.
3. If we add some part (learn in small chunks using small learning rate) of negative of this direction (that is direction opposite to increase of loss) to our model’s weights, we are basically saying to move weights in the direction in which the loss decreases. Hence descend using gradients!
4. Repeat till a stopping criteria is not met!

But, what if we don’t know how to find gradient. Well I’ll recommend you to go over and learn them, they will always come handy! But for time being I’ll be giving the end result here to make your life easy.

For those who are equipped with knowledge of differentiation but are confused that how I computed the above formula, you can find my solution here.

So our final Gradient Descent algorithm will look something like this:

Where, alpha is learning rate. Learning rate has its own importance in Gradient Descent. We keep it small so that we move our weights slowly towards minimum loss position. This is to make sure that our weights doesn’t end up diverging from minima. We will discuss more about learning rate in future posts.

Let’s complete fit() function for LinearRegression class now.

```class LinearRegression:
def __init__(self, learning_rate, n_iter):
self.learning_rate = learning_rate
self.n_iter = n_iter

def initialize_weights(self, X):
# We have same number of weights as number of features
self.weights = np.random.rand(X.shape[1], 1)
# we will also add a bias to the terms that
# can be interpretted as y intercept of our model!
self.bias = np.zeros((1,))

def fit(self, X_train, Y_train):
self.initialize_weights(X_train)

# get mean and stddev for normalization
self.x_mean = X_train.mean(axis=0).T
self.x_stddev = X_train.std(axis=0).T
self.y_mean = Y_train.mean()
self.y_stddev = Y_train.std()

# normalize training data
X_train, Y_train = self.normalize(X_train, Y_train)

for i in range(self.n_iter):
# make normalized predictions
pred = np.dot(X_train, self.weights) + self.bias
# find error
diff = pred - Y_train

# compute d/dw and d/db of MSE
delta_w = np.mean(diff * X_train, axis=0, keepdims=True).T
delta_b = np.mean(diff)

# update weights and biases
self.weights = self.weights - self.learning_rate * delta_w
self.bias = self.bias - self.learning_rate * delta_b
return self

def predict(self, X):
X = self.normalize(X)
preds = np.dot(X, self.weights) + self.bias
# return unnormalized predictions
return preds * self.y_stddev + self.y_mean

def error(self, Y_true, Y_pred):
diff = Y_true - Y_pred
squared_diff = diff ** 2
return np.mean(squared_diff)

def normalize(self, X, Y=None):
X = (X - self.x_mean) / self.x_stddev

if Y is None:
return X

Y = (Y - self.y_mean) / self.y_stddev
return X, Y
```

## TRAINING AND TESTING OUR LINEAR REGRESSION CLASS

We have done a great work so far. Let’s finally train and test it on our dataset. Make a folder and name it datasets. We will save two files in this folder – the S&P dataset which is present at kaggle and the AAL’s stock data from Yahoo finance for dates 12th April 2018 to 12th May 2018 which you can gather online. The data from Yahoo finance is not available in our train data and will give us a clear idea about the performance of our model! Make sure you name your test data file – AAL.csv.

```# make and LinearRegression model object with learning rate 0.001
# and number of iterations 1000
lr = LinearRegression(1e-3, 1000)

# get training and testing data
X_train, Y_train = get_train_data()
X_test, Y_test = get_test_data()

# train our model
lr.fit(X_train, Y_train)

# predict on train data
Y_pred_train = lr.predict(X_train)

# predict on test data
Y_pred_test = lr.predict(X_test)

# Print error on both training and test data
train_error = lr.error(Y_train, Y_pred_train)
test_error = lr.error(Y_test, Y_pred_test)
print('MSE on train: {:.3} and on test: {:.3}'.format(train_error, test_error))
```

Output:

`MSE on train: 0.613 and on test: 0.608`

So, we developed, trained and tested a LinearRegression model using numpy alone and in process we learned about Mean Squared Error and Gradient Descent. But I left out somethings which will be on you to search upon. This way you don’t fixate on one resource and learn from many. The things I left out are:

1. Why we normalize? Their are in-depth explanations out their which we will cover in future posts.
2. How not using learning rate or using a large learning rate causes divergence from minima?
3. What are other possible stopping criteria other than number of iterations?

Think about it, Google it and learn it, you will be amazed by the possibilities of answers to these questions.

## SOME USEFUL MACHINE LEARNING BOOKS

### This Post Has 3 Comments

SUBSCRIBE FOR WEEKLY POST UPDATES <3