Machine Learning from Scratch – Logistic Regression

In the last post, we tackled the problem of developing Linear Regression from scratch using a powerful numerical computational library, NumPy. This means we are well-equipped in understanding basic regression problems in Supervised Learning scenario. That is, we can now build a simple model that can take in few numbers and predict continuous values that corresponds to the input. Great!

But what about discrete values? How can we classify between – apples and oranges or dogs and cats or spam and not spam or you and Tom Cruise (just kidding, I know you are not. Well, until he really needs an AI actor 😉 )?

WHY CLASSIFICATION?

Before going ahead, it is important that your mind is prepared for why we do what we do. So, let’s discuss why do we classify things and what are its benefits?

Open up your gmail account and visit over to spam section. If you haven’t till date, you will be surprised to see the chunk of bad mails gmail saves you from! A chunk of mails so meaningless that you can really have hard time finding an important mail if these were to be in your inbox. You just saw an example of how classification is useful.

This is just one example. There are million others out there in wild which gives you an idea of how classification is helpful in daily life. This makes it important for a Machine Learning person to get acquainted with some of the basic classification techniques.

Now, that’s out of the way. Let’s discuss the most basic classification algorithm – Logistic Regression. Don’t blame me! I didn’t ask them to name a classification algorithm – Logistic Regression! But whatever? \_(^_^)_/

WHAT WILL WE BUILD?

We will build a simple model that will be able to take some details about breast cancer and will inform us whether the cancer is Malignant (M) or Benign (B).

  • Malignant: A malignant tumor has a tendency to invade its surrounding tissues or spread around the body.
  • Benign: A benign tumor have no such tendency.

OUR DATASET

We will be working with Breast Cancer Wisconsin (Diagnostic) Data Set. Let’s have a look at the attributes in the dataset.

  1. ID: ID number
  2. Diagnosis: Malignant (M) / Benign (B)

Other than this, there are 10 real-valued parameters concerning the cancer that were compared. The 10 params are:

  • radius (mean of distances from center to points on the perimeter)
  • texture (standard deviation of gray-scale values)
  • perimeter
  • area
  • smoothness (local variation in radius lengths)
  • compactness (perimeter² / area — 1.0)
  • concavity (severity of concave portions of the contour)
  • concave points (number of concave portions of the contour)
  • symmetry
  • fractal dimension (“coastline approximation” — 1)

For all of these 10 params, 3 statistical values – mean, standard error and “worst” or largest (mean of the three largest values) of these features were computed for each image. Thus, the dataset has 32 columns – 30 for above mentioned values and 2 for ID and Diagnosis.

LOADING AND SPLITTING DATASET

Let’s load our dataset and split it so that we can see how well our model performs on the unseen data.

Loading data:

import numpy as np
import pandas as pd

df = pd.read_csv('dataset/data.csv')

X = df[df.columns[2:]].values
Y = df['diagnosis'].values

Y = (Y == 'M').astype('float')

Y = np.expand_dims(Y, -1)

Splitting Data:

def train_test_split(X, Y, split=0.2):
    indices = np.random.permutation(X.shape[0])
    split = int(split * X.shape[0])

    train_indices = indices[split:]
    test_indices = indices[:split]

    x_train, x_test = X[train_indices], X[test_indices]
    y_train, y_test = Y[train_indices], Y[test_indices]

    return x_train, y_train, x_test, y_test

x_train, y_train, x_test, y_test = train_test_split(X, Y)

print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

LOGISTIC REGRESSION

Logistic Regression can be considered as an extension to Linear Regression. We know that the Linear Regression models are continuous functions that provide real-valued results for inputs. But here we need discrete value, Malignant or Benign, for each input. So how can we improvise using our knowledge from previous post to build a classifier?

Ummm. We need a value representing whether the cancer is Malignant or Benign. Or we can rewrite it as we need a value representing how much malignant a cancer is. Going further, we can say, we need a value representing the probability of a cancer being malignant. So, if we have 0.6 probability of a cancer being malignant, we can say that we have a 0.4 (= 1 – 0.6) probability of the cancer being benign!

How can this be done? Fortunately, Linear regression gives us a desired score value. The only thing left is to squash this down to a range of [0, 1]. The function recommended for this purpose is the logistic function (a.k.a sigmoid function).

SIGMOID FUNCTION

What the heck is a sigmoid function? The function represents a curve that at minima approaches 0 and at the maxima approaches 1. Other great thing about this function is that it approaches minima for higher negative values and approaches maxima for higher positive values. This gives us a chance to define our score in such a manner that a Malignant cancer will have a higher positive score while a Benign cancer will have a higher negative score!

Our main ingredients are ready. Let’s visualize what the final dish will look like before actually preparing it!

Here, we have divided our Logistic Regression model into 2 different parts:

  1. The one that calculates scores using linear operations – Multiplication and addition.
  2. The other that calculates probabilities using non-linear operation – sigmoid function.

Now that we have understood the basic nut and bolts of how a trained Logistic Regression model will work, let’s move on to what intuitively it will look like. In other words, let’s understand how it separated one class from another.

DECISION BOUNDARY

Generally, a classifier defines a boundary between two classes. This boundary is called Decision Boundary. The name speaks for itself. If a new point comes into the model and it is on positive side of the Decision Boundary then it will be given the positive class, with higher probability of being positive, else it will be given a negative class, with lower probability of being positive.

If you, my friend, love mathematical terms, then we call these boundaries  – hyper-planes that separate one class from others. Let’s see examples of these “hyper-planes” in 2 dimension and 3 dimension:

So, how do we mathematically define the decision boundary? Decision boundary is basically the hyperplane represented by the equation:

This means that decision boundary is the line where probability of being positive is 0.5. But how do we know this? This is because sigmoid(0) = 0.5. How do we compute wi such that the probabilities are correct? From previous post, we know that we minimized the loss function, Mean Squared Error, to compute the wi for linear regression. But here we need a different loss function, a function that can help us define the lack of probable scenarios. In other words, we need a loss function that gives higher loss when the likelihood of being positive is wrongly produced by the model. To come up with a loss function we need to understand the concept of Maximum Likelihood Estimation.

MAXIMUM LIKELIHOOD ESTIMATION

Wikipedia’s first line on MLE:
In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a statistical model, given observations. MLE attempts to find the parameter values that maximize the likelihood function, given the observations. The resulting estimate is called a maximum likelihood estimate, which is also abbreviated as MLE.
Well that’s all cool but we will work on something that is easy for us. Say, we have only 4 points- 2 positive and 2 negative, and two random lines – one that classifies two points correctly and two incorrectly, and the other that classifies all 4 points correctly.

Decision boundary classifies 2 points correctly and 2 incorrectly.

Decision boundary classifies all points correctly.

Likelihood of a system is calculated as the product of probability being positive for all blue points multiplied with product of probability being negative for all red points.

  • Likelihood of true values in case 1 (LHS) = 0.8 * 0.4 * (1 – 0.6) * (1 – 0.2) = 0.1024
  • Likelihood of true values in case 2 (RHS) = 0.6 * 0.9 * (1 – 0.15) * (1 – 0.4) = 0.2754

It is clear that case 2 has better score than case 1. But multiplication of probabilities. That’s scary!!!!

Luckily, we have our log function at our side. We know that log(ab) = log(a) + log(b). So our log-likelihood becomes –

log(likelihood) = log(0.6) + log(0.9) + log(1 – 0.15) + log(1 – 0.4) = -0.51 + (-0.105) + (-0.162) + (-0.51) = -1.287

Since log of numbers between 0 and 1 is negative, we add a negative sign to find the log-likelihood.

-log(likelihood) = -(-1.287) = 1.287

This is what we call cross-entropy. Since we have added a negative sign ahead of the log-likelihood, it is clear that minimizing cross-entropy is similar to maximizing likelihood of the model! That’s what we just needed, a thing to call a loss function!

Let’s right down general formula of cross-entropy for a binary classifier:

Here,

  • m: Number of examples
  • yhat: Predicted probability of being positive
  • y: True value (1 – Positive; 0 – Negative)

So, we are computing mean of something, that is clear from the sum of m terms divided by m. But what the heck just we do to the cross-entropy we discussed before?

HOW DOES BINARY CROSS-ENTROPY WORKS?

Let’s look at the solution to one term out of all m terms:

Now, let’s consider the two possible cases: 1) input belonged to positive class and 2)input belonged to negative class.

Case 1: y = 1 (True value is Positive)

y = 1

(1 – y) = 0

The term becomes:

-1 log(yhat) – 0 log(1 – yhat) = -log(yhat)

Case 2: y = 0 (True value is Negative)

y = 0

(1 – y) = 1

The term becomes:

-0 log(yhat) – 1 log(1 – yhat) = -log(1 – yhat)

This is what we were actually doing in the cross entropy discussed before. We found log of probabilities for all positive points (of being positive) and log of probabilities for all negative points (of not being positive).

MINIMIZING BINARY CROSS-ENTROPY

We need to minimize binary cross entropy of our model. What is the better way to do so than using the already discussed Gradient Descent Algorithm? But for this to work, we need to find the derivative of loss function w.r.t weights! For those who really want to go over the derivative finding process, you can find my solution here, else the direct result is:

Here xi represents ith value of vector x. Each value in vector x represents sum of all the training examples for a given feature.

LOGISTIC REGRESSION CLASS

Let’s define the basic structure of our Logistic Regression class:

class LogisticRegression:
    def __init__(self, lr=0.01, n_iter=100):
        pass

    def predict(self, X):
        pass

    def _non_linear(self, X):
        pass

    def _linear(self, X):
        pass

    def fit(self, X_train, Y_train):
        pass

    def normalize(self, X):
        pass

    def accuracy(self, X, y):
        pass

    def loss(self, X, y):
        pass

Here,

  • __init__(): Constructor takes the learning rate (lr) as well as number of iterations (n_iter) as params. See last post for details.
  • predict(): Takes input features (X) and predicts the result. It depends on two helper methods _linear() and _non_linear().
  • _linear(): Takes input features (X) and apply weighted sum – The first part of the prediction.
  • _non_linear(): Takes input (X) result of _linear() and apply sigmoid – The second part of the prediction.
  • fit(): Our gradient descent process! It takes in features (X_train) and true labels (Y_train) to fine-tune weights using gradient descent!
  • normalize(): It is always better to normalize inputs, details of it were discussed in last post.
  • accuracy(): Find the accuracy of the model. It is equal to mean of number of correct predictions.
  • loss(): Computes cross-entropy.

PREDICTION FUNCTION

As discussed, our prediction method depends on _linear() and _non_linear() methods! Just assume for now that the normalize() method is up and ready along with the fact that the object holds weights and biases. This here helps us get rid of easy things before hand so that we can look at more complex fit() method ahead!

  • _linear(): Takes X computes score as we did in LinearRegression. Just a matrix multiplication with weights and adding up the bias!
  • _non_linear(): Takes X computes the sigmpod function and returns the result!
  • Finally, predict(): Takes X, normalizes it, computes the linear and non-linear parts, and finally returns 1 for probability>=0.5 else 0!
class LogisticRegression:
    def __init__(self, lr=0.01, n_iter=100):
        self.lr = lr
        self.n_iter = n_iter

    def predict(self, X):
        X = self.normalize(X)
        linear = self._linear(X)
        preds = self._non_linear(linear)
        return (preds >= 0.5).astype('int')

    def _non_linear(self, X):
        return 1 / (1 + np.exp(-X))

    def _linear(self, X):
        return np.dot(X, self.weights) + self.bias

    def fit(self, X_train, Y_train):
        pass

    def normalize(self, X):
        pass

    def accuracy(self, X, y):
        pass

    def loss(self, X, y):
        pass

HELPER FUNCTIONS – NORMALIZE, ACCURACY & LOSS

Normalization:

Let’s go over the normalization process a bit. We do normalization for the fact that some values can range from 0 to 1 while others can range from 0 to 1000. This can give unfair weightage to the values that have more higher values and can reduce the efficiency of Gradient Descent. To overcome this, we convert every value such that the mean is 0 and standard deviation 1. But make sure that we use the mean and standard deviation of training set on prediction data as well. This makes sure that our prediction data has some resemblance to the training data!

Accuracy

Perhaps the most easiest of the method! Find all the correct predictions, count them and divide by the total number of predictions made!

Loss

As discussed above, we have to find the mean of binary cross-entropy for each and every prediction. Remember that log of 0 is undefined and can cause issues with calculations. So, we add a subtle value of 10-15 to inputs of log function.

Let’s look at the implementations:

class LogisticRegression:
    def __init__(self, lr=0.01, n_iter=100):
        self.lr = lr
        self.n_iter = n_iter

    def predict(self, X):
        X = self.normalize(X)
        linear = self._linear(X)
        preds = self._non_linear(linear)
        return (preds >= 0.5).astype('int')

    def _non_linear(self, X):
        return 1 / (1 + np.exp(-X))

    def _linear(self, X):
        return np.dot(X, self.weights) + self.bias

    def fit(self, X_train, Y_train):
        pass

    def normalize(self, X):
        X = (X - self.x_mean) / self.x_stddev
        return X

    def accuracy(self, X, y):
        preds = self.predict(X)
        return np.mean(preds == y)

    def loss(self, X, y):
        probs = self._non_linear(self._linear(X))

        # entropy when true class is positive
        pos_log = y * np.log(probs + 1e-15)
        # entropy when true class is negative
        neg_log = (1 - y) * np.log((1 - probs) + 1e-15)

        l = -np.mean(pos_log + neg_log)
        return l

GRADIENT DESCENT

Finally, let’s define the fit() method for training our model! We know that Gradient Descent at each step calculates the gradients of the loss function w.r.t. the weights of the model. Some part of these gradients (learning rate fraction of gradients) are subtracted from the weights to move down the loss curve in the direction towards minima.

We know that gradients of loss function, binary cross-entropy, w.r.t weights can be calculated using formula:

Now, let’s move on to defining the model’s fit() method. We will be defining one more method initialize_weights(self, X) that, as name suggests, initializes weights for us!

class LogisticRegression:
    def __init__(self, lr=0.01, n_iter=100):
        self.lr = lr
        self.n_iter = n_iter

    def predict(self, X):
        X = self.normalize(X)
        linear = self._linear(X)
        preds = self._non_linear(linear)
        return (preds >= 0.5).astype('int')

    def _non_linear(self, X):
        return 1 / (1 + np.exp(-X))

    def _linear(self, X):
        return np.dot(X, self.weights) + self.bias

    def initialize_weights(self, X):
        # We have same number of weights as number of features
        self.weights = np.random.rand(X.shape[1], 1)
        # we will also add a bias to the terms that
        # can be interpretted as y intercept of our model!
        self.bias = np.zeros((1,))

    def fit(self, X_train, Y_train):
        self.initialize_weights(X_train)

        # get mean and stddev for normalization
        self.x_mean = X_train.mean(axis=0).T
        self.x_stddev = X_train.std(axis=0).T

        # normalize data
        X_train = self.normalize(X_train)

        # Run gradient descent for n iterations
        for i in range(self.n_iter):
            # make normalized predictions
            probs = self._non_linear(self._linear(X_train))
            diff = probs - Y_train

            # d/dw and d/db of mse
            delta_w = np.mean(diff * X_train, axis=0, keepdims=True).T
            delta_b = np.mean(diff)

            # update weights
            self.weights = self.weights - self.lr * delta_w
            self.bias = self.bias - self.lr * delta_b
        return self

    def normalize(self, X):
        X = (X - self.x_mean) / self.x_stddev
        return X

    def accuracy(self, X, y):
        preds = self.predict(X)
        return np.mean(preds == y)

    def loss(self, X, y):
        probs = self._non_linear(self._linear(X))

        # entropy when true class is positive
        pos_log = y * np.log(probs + 1e-15)
        # entropy when true class is negative
        neg_log = (1 - y) * np.log((1 - probs) + 1e-15)

        l = -np.mean(pos_log + neg_log)
        return l

Finally, let’s initialize our model and train it over the data we loaded and split before!

lr = LogisticRegression()
lr.fit(x_train, y_train)

Time to test what our model is capable of!

print('Accuracy on test set: {:.2f}%'.format(lr.accuracy(x_test, y_test) * 100))
print('Loss on test set: {:.2f}'.format(lr.loss(x_test, y_test)))

Output:

Accuracy on test set: 98.23%
Loss on test set: 23.84

Our model just got the accuracy of 98.23%. Now, let me tell you a secret! We could have made use of a predefined class in sklearn library 😉

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss

lr = LogisticRegression()
lr.fit(x_train, y_train[:, 0])
print('Accuracy on test set: {:.2f}%'.format(lr.score(x_test, y_test[:, 0]) * 100))
print('Loss on test set: {:.2f}'.format(log_loss(y_test[:, 0], lr.predict(x_test))))

Even though we had a predefined class, it is always important to have intuition about how a machine learning model truly works! This gives us a chance to innovate on old model and produce new ones as well as to have a better understanding of situations when our models don’t work (believe me you will face a lot of such situations in the future)!

SOME USEFUL MACHINE LEARNING BOOKS

  1. Hands-On Machine Learning with Scikit-Learn and Tensor Flow: Concepts, Tools, and Techniques to Build Intelligent Systems
  2. Data Science from Scratch
  3. Machine Learning

This Post Has 6 Comments

  1. This is beautiful bro. You good.

  2. Good Explanation. Really very appreciated blog

  3. Thanks for your good article! Good job!

Leave a Reply

Close Menu

SUBSCRIBE FOR WEEKLY POST UPDATES <3