Skip to content

This repository explores the different hyper parameters available in Logistic regression algorithm and how they can be fine-tuned to obtain the best possible model for the given dataset.

Notifications You must be signed in to change notification settings

camparchimedes/Finetuning-Logisitic-Regression-Model

 
 

Repository files navigation

Finetuning Hyperparameters of Logistic regression ML Algorithm

The basic Logisitic Regression model is a supervised classification ML algorithm, that ideally works on binary classification problems. There are various hyperparameters that can be modified in order to fine tune the model performance and obtain the best possible results. List of hyperparameters used in the below script are:

  • Penalty: In order to optmize the performance of the model and trace the important features different penalties can be employed. Lasso (L1) , Ridge (L2) and ElasticNet are the three types of penalties that can be used.
  • Solver: Since Logisitic Regression algorithm works on optimization technique, various optmization methods are available to be used in the model. The selection of right optimizer solver depends on the penalty. Different optmizer solvers are compatible with different penalty types and hence proper selection of the optimizer solver is important.
  • C: C is called the regularization parameter. Technically C is inverse of the penalty term. But an easier way to understand C is to relate it to the model complexity. C denotes the model complexity. Smaller values of C indicate a simple model and larger values of C indicate complex models. Selecting the optimum value of C to create a balanced model is must.
  • max_iter: This parameter specifies the maximum number of iterations required to reach the minima. If the number of iterations needed to converge is higher than the max_iter provided the model would fail to converge and hence the true minima of the loss value is not achieved. This hinders the model in achieving the best possible performance. Hence providing a wide range of max_iter values helps the model to achieve better performance.

Installing Dependencies

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

from sklearn.exceptions import ConvergenceWarning
ConvergenceWarning("ignore")
sklearn.exceptions.ConvergenceWarning("ignore")

Data Ingestion

bio = pd.read_csv("healthcare_data.csv")
bio.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean ... texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst Unnamed: 32
0 842302 M 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 ... 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 NaN
1 842517 M 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 ... 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 NaN
2 84300903 M 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 ... 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 NaN
3 84348301 M 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 ... 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 NaN
4 84358402 M 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 ... 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 NaN

5 rows × 33 columns

bio.shape
(569, 33)
bio.columns
Index(["id", "diagnosis", "radius_mean", "texture_mean", "perimeter_mean",
       "area_mean", "smoothness_mean", "compactness_mean", "concavity_mean",
       "concave points_mean", "symmetry_mean", "fractal_dimension_mean",
       "radius_se", "texture_se", "perimeter_se", "area_se", "smoothness_se",
       "compactness_se", "concavity_se", "concave points_se", "symmetry_se",
       "fractal_dimension_se", "radius_worst", "texture_worst",
       "perimeter_worst", "area_worst", "smoothness_worst",
       "compactness_worst", "concavity_worst", "concave points_worst",
       "symmetry_worst", "fractal_dimension_worst", "Unnamed: 32"],
      dtype="object")
bio.info()
<class "pandas.core.frame.DataFrame">
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             569 non-null    float64
 15  area_se                  569 non-null    float64
 16  smoothness_se            569 non-null    float64
 17  compactness_se           569 non-null    float64
 18  concavity_se             569 non-null    float64
 19  concave points_se        569 non-null    float64
 20  symmetry_se              569 non-null    float64
 21  fractal_dimension_se     569 non-null    float64
 22  radius_worst             569 non-null    float64
 23  texture_worst            569 non-null    float64
 24  perimeter_worst          569 non-null    float64
 25  area_worst               569 non-null    float64
 26  smoothness_worst         569 non-null    float64
 27  compactness_worst        569 non-null    float64
 28  concavity_worst          569 non-null    float64
 29  concave points_worst     569 non-null    float64
 30  symmetry_worst           569 non-null    float64
 31  fractal_dimension_worst  569 non-null    float64
 32  Unnamed: 32              0 non-null      float64
dtypes: float64(31), int64(1), object(1)
memory usage: 146.8+ KB
bio.diagnosis.unique()
array(["M", "B"], dtype=object)

Brief description of the data

  • Total 569 records of patients and 33 columns
  • id is a unique column that does not contribue to prediction
  • unnamed: 32 is a column with no value in it and hence cannot contribute
  • diagnosis is an object type feature with only two values: M and B
  • diagnosis is the target variable with two categories: M and B
  • Remaining all the columns are float type and act as predictor variables

Basic Preprocessing

# Remove features "id" & "Unnamed: 32" since they do not help in predicting
bio = bio.drop(["id", "Unnamed: 32"],axis=1)
# Encoding target variable "diagnosis"
bio["diagnosis"] = bio["diagnosis"].map({"M": 0,"B": 1})

#Encoding: a preprocessing technique used to convert categorical variables to number codes.
bio.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
0 0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 ... 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
1 0 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 ... 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902
2 0 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 ... 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758
3 0 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 ... 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300
4 0 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 ... 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678

5 rows × 31 columns

bio.shape
(569, 31)

Above dataset is finally transformed with all numerical values. 30 predictor variables and 1 target variable.

Splitting the dataset into X and y

X = bio.drop(["diagnosis"],axis = 1)
y = bio["diagnosis"]
X_train,x_test,y_train,y_test=train_test_split(X,y,test_size=1/3,random_state=32)

Logistic regression is a distance based model, hence all numerical features must be scaled.

ss = StandardScaler()
X_train = ss.fit_transform(X_train)
x_test = ss.transform(x_test)

Basic model training and inferencing

Instantiation and training

lr = LogisticRegression(random_state=10)
model = lr.fit(X_train, y_train)

Inferencing

y_pred = model.predict(x_test)

Evaluation

print(accuracy_score(y_test,y_pred))
0.9578947368421052
print(model.get_params)
<bound method BaseEstimator.get_params of LogisticRegression(random_state=10)>

Hyperparameters Finetuning

Hyperparameters used:

  • penalty: L1 , L2 and elasticnet. By default penalty is L2.
  • C: regularization parameter. This depicts the complexity of model
  • solver: Optimization problems can be solved by different solvers
  • max_iters: in how many steps the global minima is reached
from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
simplefilter("ignore", category=ConvergenceWarning)
param_grid = [    
    {"penalty" : ["l1", "l2", "elasticnet"],
    "C" : np.logspace(-4, 4, 20),
    "solver" : ["lbfgs","newton-cg","liblinear"],
    "max_iter" : [500, 1000, 1500]
    }
]

3*20*5*4
1200
np.logspace(-4,4,20)   #.0001 to 10000
array([1.00000000e-04, 2.63665090e-04, 6.95192796e-04, 1.83298071e-03,
       4.83293024e-03, 1.27427499e-02, 3.35981829e-02, 8.85866790e-02,
       2.33572147e-01, 6.15848211e-01, 1.62377674e+00, 4.28133240e+00,
       1.12883789e+01, 2.97635144e+01, 7.84759970e+01, 2.06913808e+02,
       5.45586378e+02, 1.43844989e+03, 3.79269019e+03, 1.00000000e+04])

Comments:

  • It is just way to give range of values that c can take.
  • You may also use the simple approach as shown below:
  • "C" : [.0001, .001, .01, .1, 1, 10, 100, 1000, 10000]
import warnings
warnings.simplefilter("ignore", category=ConvergenceWarning)
gsc = GridSearchCV(lr, param_grid = param_grid, cv = 10,verbose=True, n_jobs=-1)
tuned_model = gsc.fit(X_train,y_train)
Fitting 10 folds for each of 540 candidates, totalling 5400 fits
tuned_model.best_estimator_
LogisticRegression(C=0.615848211066026, max_iter=500, random_state=10)
print (f"Accuracy - : {tuned_model.score(x_test,y_test):.3f}")
Accuracy - : 0.963
lr_finetuned = lr.set_params(C = 0.615848211066026, 
                             max_iter = 500,
                             random_state = 10)
model_final = lr_finetuned.fit(X_train, y_train)
y_pred = model_final.predict(x_test)
print(accuracy_score(y_test,y_pred))
0.9631578947368421
model_final.get_params
<bound method BaseEstimator.get_params of LogisticRegression(C=0.615848211066026, max_iter=500, random_state=10)>
model_final.penalty
"l2"
model_final.coef_
array([[-0.48507112, -0.39769552, -0.44891167, -0.5166906 , -0.25795931,
         0.22433419, -0.59320734, -0.65494822,  0.14756751,  0.20464987,
        -1.03507872, -0.18521633, -0.58477086, -0.78295621, -0.05402549,
         0.45576588, -0.01313436, -0.05961217,  0.47307332,  0.38539697,
        -0.87864902, -0.94643093, -0.75975075, -0.82213525, -0.58373362,
        -0.03479723, -0.57764982, -0.5404704 , -0.51782045, -0.29460082]])

Experiment:Working with L1

model_l1 = LogisticRegression(penalty="l1",C=1, max_iter =10, random_state=10, solver = "liblinear" )
model_l1_fit = model_l1.fit(X_train,y_train)
y_pred_l1 = model_l1.predict(x_test)
accuracy_score(y_test,y_pred_l1)
0.9526315789473684
model_l1_fit.coef_
array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        , -0.25240274, -0.8403526 ,  0.        ,  0.        ,
        -2.41841316,  0.        ,  0.        , -0.02519683,  0.        ,
         0.27464826,  0.        ,  0.        ,  0.27932633,  0.21351214,
        -1.55360797, -1.56797121, -0.47396785, -3.07790957, -0.76127109,
         0.        , -0.59962018, -0.4315926 , -0.21474998,  0.        ]])

Observations:

  • The highest accuracy is obtained after finetuning the model.

  • Base model accuracy: 95.79%; Fine-tuned model accuracy: 96.32%

  • With L1 penalty specific experiment is done just to verify the results.

  • The coefficients of L1 model show how many feature coefficients are equated to zero.

  • So L1 does help in feature selection. All non-zero coefficient features are considered for final model.

About

This repository explores the different hyper parameters available in Logistic regression algorithm and how they can be fine-tuned to obtain the best possible model for the given dataset.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%