Logistic Regression in Machine Learning

In this article, I am going to discuss Logistic Regression in Machine Learning with Examples. Please read our previous article where we discussed Linear Regression in Machine Learning with Examples.

Logistic Regression in Machine Learning

Logistic regression is a classification technique that uses supervised learning to estimate the likelihood of a target variable. Because the nature of the target or dependent variable is dichotomous, there are only two viable classes. Simply said, the dependent variable is binary in nature, with data represented as either 1 (for success/yes) or 0 (for failure/no).

A logistic regression model predicts P(Y=1) as a function of X mathematically. It is one of the most basic ML techniques that may be used to solve a variety of classification issues such as spam identification, diabetes prediction, cancer diagnosis, and so on.

In general, logistic regression refers to binary logistic regression with binary target variables, but it may also predict two other types of target variables. Logistic regression may be classified into the following types based on the number of categories:

Binomial or binary: A dependent variable in this form of classification will have just two potential values: 1 or 0. These variables might, for example, indicate success or failure, yes or no, victory or loss, and so on.
Multinomial: The dependent variable in such a classification can have three or more alternative unordered categories or types with no quantitative significance. These variables may, for example, represent “Type A,” “Type B,” or “Type C.”
Ordinal: In this sort of categorization, the dependent variable might have three or more potential ordered categories or types with quantitative significance. For example, these variables may indicate “bad” or “good,” “very good,” or “Excellent,” with scores ranging from 0 to 2.

Assumptions for Logistic Regression in Machine Learning

Before we dive into logistic regression implementation, we must be aware of the following assumptions regarding the same.
The target variables in binary logistic regression must always be binary, and the intended outcome is indicated by factor level 1.
The model should not have any multicollinearity, which indicates that the independent variables must be independent of one another.
Our model must incorporate relevant variables.

Case Study in Banking Domain

Introduction

According to the CMD’s most recent Quarterly Report on Household Debt and Credit, overall household debt climbed by $ 92 billion in the third quarter of 2019 to $ 13.95 trillion. It was the twenty-first straight quarterly gain, and the total is now $ 1.3 trillion more than the previous peak of $ 12.68 trillion in the third quarter of 2008. In the third quarter, non-housing balances climbed by $ 64 billion, with increases across the board, including $ 18 billion in auto loans, $ 13 billion in credit card balances, and $ 20 billion in student loans.

Problem Statement

IndNatBank is a peer-to-peer lending finance organization that lends to potential consumers all around India. They earn depending on the risk of the loans they offer to borrowers. They intend to estimate the risk of giving loans to new customers based on prior data, which will also assist to enhance the customization user experience while applying for loans.

They have skilled staff who apply sophisticated rules to give services to their clients. However, as the number of data grows, traditional methods of risk assessment may be detrimental to the firm. They want to automate their process so that the machine can discover patterns from their data and provide a better client experience.

Importing Libraries

# For Panel Data Analysis
import pandas as pd
#from pandas_profiling import ProfileReport
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('mode.chained_assignment', None)

# For Numerical Python
import numpy as np

# For Random seed values
from random import randint

# For Data Visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# For Scientific Computation
from scipy import stats

# For Preprocessing & Scaling
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, MinMaxScaler, StandardScaler

# For Feature Selection
from sklearn.feature_selection import SelectFromModel

# For Data Modeling and Evaluation
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost.sklearn import XGBClassifier

# For Machine Learning Model Evaluation
from sklearn.metrics import classification_report
from yellowbrick.classifier import PrecisionRecallCurve
from xgboost import to_graphviz, plot_importance

# To handle class imbalance problem
from imblearn.over_sampling import SMOTE

# To Disable Warnings
import warnings
warnings.filterwarnings(action = "ignore", message = '')


data = pd.read_csv(‘LoanDefault.csv’)
print('Data Shape:', data.shape)
data.head()

Data Description

data.describe()

data.info()

Data Pre-Processing

# Correcting types of features
data['date_issued'] = pd.to_datetime(data['date_issued'])
data['date_final'] = pd.to_datetime(data['date_final'], format = '%d%m%Y')
data['is_default'] = data['is_default'].astype(bool)

# Dropping cust_id as it is unique
data2 = data.copy()
data2 = data2.drop('cust_id', axis = 1)

EDA

Question 1: What is the proportion of customers who are defaulters and who are not?

print('Customers who are not default:', data['is_default'].value_counts()[0])
print('Customers who are default:', data['is_default'].value_counts()[1])

space = np.ones(2)/10
data['is_default'].value_counts().plot(kind = 'pie', explode = space, fontsize = 14, autopct = '%3.1f%%', wedgeprops = dict(width=0.15),
                                      shadow = True, startangle = 160, figsize = [13.66, 7.68], legend = True)
plt.legend(['Not Default', 'Default'])
plt.ylabel('Category')
plt.title('Proportion of default customers', size = 14)
plt.show()

Question 2: What is the rate of loan default with respect to final_date?

figure = plt.figure(figsize = [15, 8])

data[data['is_default'] == 1]['date_final'].value_counts().plot(kind = 'line')

plt.xlabel('Year', size = 14)
plt.ylabel('Frequency', size = 14)
plt.title('Loan Default rate at each year', size = 16)
plt.show()

Question 3: What is the frequency & proportion of ownership type that has been acquired with respect to the loan?

print(data['own_type'].value_counts())

# Bar Plot
colors_list = ['lightcoral', 'lightgreen', 'mediumturquoise']
figure = plt.figure(figsize = [15, 8])
plt.subplot(1, 2, 1)
sns.barplot(data['own_type'].value_counts().index, data['own_type'].value_counts(), palette  = colors_list)
plt.yticks(range(0, 500000, 20000))
plt.xlabel('Ownership Type')
plt.ylabel('Frequency')
plt.title('Frequency occurence of Ownership Type', y=1.05, size = 14)

explode_list = [0, 0 , 0.2]

plt.subplot(1, 2, 2)
# Pie Plot
data['own_type'].value_counts().plot(kind = 'pie', figsize = [10, 5], autopct = '%1.1f%%', startangle = 90,
                                    shadow = True, labels = None, pctdistance = 1.12, colors = colors_list,
                                    explode = explode_list)
plt.title('Proportion of each Ownership Type w.r.t loan', y = 1.05, size = 14)
plt.ylabel('')
plt.axis('equal')
plt.legend(labels = data['own_type'].value_counts().index, loc = 'upper left', frameon = False)
plt.tight_layout(pad=2.0)
plt.show()

data.drop(labels = [‘cust_id’, ‘date_issued’, ‘date_final’,’State’], axis = 1, inplace = True)

Label Encoding

ordered_labels = ['year', 'income_type', 'app_type', 'interest_payments', 'grade', 'loan_duration']
encode = LabelEncoder()

for i in ordered_labels:
 if isinstance(data[i].dtype, object):
   data[i] = encode.fit_transform(data[i])

data = pd.get_dummies(data = data, columns = ['own_type', 'loan_purpose'])
print('Data Shape:', data.shape)
data.head()

std_scale = StandardScaler()
scale_fit = std_scale.fit_transform(X)

X_data = pd.DataFrame(scale_fit, columns = X.columns)

X_train, X_test, y_train, y_test = train_test_split(X_data, y, test_size = 0.2, random_state = 42, stratify = y)
print('Train Shape:', X_train.shape, y_train.shape)
print('Test Shape:', X_test.shape, y_test.shape)

Model Building

log  = LogisticRegression(random_state = 42)
log.fit(X_train, y_train)
y_pred = log.predict(X_test)

# Accuracy Estimation
print('Accuracy Score (Train Data):', np.round(log.score(X_train, y_train), decimals = 3))
print('Accuracy Score (Test Data):', np.round(log.score(X_test, y_test), decimals = 3))

# Classification Report
logistic_report = classification_report(y_test, y_pred)
print(logistic_report)

# Precision Recall Curve
figure = plt.figure(figsize = [10, 8])
PRCurve(model = log)
plt.show()

In the next article, I am going to discuss Univariate, Bivariate, and Multicollinearity Analysis in Machine Learning with Examples. Here, in this article, I try to explain Logistic Regression in Machine Learning with Examples. I hope you enjoy this Logistic Regression in Machine Learning with Examples article.

Dot Net Tutorials

About the Author: Pranaya Rout

Pranaya Rout has published more than 3,000 articles in his 11-year career. Pranaya Rout has very good experience with Microsoft Technologies, Including C#, VB, ASP.NET MVC, ASP.NET Web API, EF, EF Core, ADO.NET, LINQ, SQL Server, MYSQL, Oracle, ASP.NET Core, Cloud Computing, Microservices, Design Patterns and still learning new technologies.