The basic Logisitic Regression model is a supervised classification ML algorithm, that ideally works on binary classification problems. There are various hyperparameters that can be modified in order to fine tune the model performance and obtain the best possible results. List of hyperparameters used in the below script are:
- Penalty: In order to optmize the performance of the model and trace the important features different penalties can be employed. Lasso (L1) , Ridge (L2) and ElasticNet are the three types of penalties that can be used.
- Solver: Since Logisitic Regression algorithm works on optimization technique, various optmization methods are available to be used in the model. The selection of right optimizer solver depends on the penalty. Different optmizer solvers are compatible with different penalty types and hence proper selection of the optimizer solver is important.
- C: C is called the regularization parameter. Technically C is inverse of the penalty term. But an easier way to understand C is to relate it to the model complexity. C denotes the model complexity. Smaller values of C indicate a simple model and larger values of C indicate complex models. Selecting the optimum value of C to create a balanced model is must.
- max_iter: This parameter specifies the maximum number of iterations required to reach the minima. If the number of iterations needed to converge is higher than the max_iter provided the model would fail to converge and hence the true minima of the loss value is not achieved. This hinders the model in achieving the best possible performance. Hence providing a wide range of max_iter values helps the model to achieve better performance.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.exceptions import ConvergenceWarning
ConvergenceWarning("ignore")
sklearn.exceptions.ConvergenceWarning("ignore")
bio = pd.read_csv("healthcare_data.csv")
bio.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
id | diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | ... | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | Unnamed: 32 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 842302 | M | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | ... | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 | NaN |
1 | 842517 | M | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | ... | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 | NaN |
2 | 84300903 | M | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | ... | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 | NaN |
3 | 84348301 | M | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | ... | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 | NaN |
4 | 84358402 | M | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | ... | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 | NaN |
5 rows × 33 columns
bio.shape
(569, 33)
bio.columns
Index(["id", "diagnosis", "radius_mean", "texture_mean", "perimeter_mean",
"area_mean", "smoothness_mean", "compactness_mean", "concavity_mean",
"concave points_mean", "symmetry_mean", "fractal_dimension_mean",
"radius_se", "texture_se", "perimeter_se", "area_se", "smoothness_se",
"compactness_se", "concavity_se", "concave points_se", "symmetry_se",
"fractal_dimension_se", "radius_worst", "texture_worst",
"perimeter_worst", "area_worst", "smoothness_worst",
"compactness_worst", "concavity_worst", "concave points_worst",
"symmetry_worst", "fractal_dimension_worst", "Unnamed: 32"],
dtype="object")
bio.info()
<class "pandas.core.frame.DataFrame">
RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 569 non-null int64
1 diagnosis 569 non-null object
2 radius_mean 569 non-null float64
3 texture_mean 569 non-null float64
4 perimeter_mean 569 non-null float64
5 area_mean 569 non-null float64
6 smoothness_mean 569 non-null float64
7 compactness_mean 569 non-null float64
8 concavity_mean 569 non-null float64
9 concave points_mean 569 non-null float64
10 symmetry_mean 569 non-null float64
11 fractal_dimension_mean 569 non-null float64
12 radius_se 569 non-null float64
13 texture_se 569 non-null float64
14 perimeter_se 569 non-null float64
15 area_se 569 non-null float64
16 smoothness_se 569 non-null float64
17 compactness_se 569 non-null float64
18 concavity_se 569 non-null float64
19 concave points_se 569 non-null float64
20 symmetry_se 569 non-null float64
21 fractal_dimension_se 569 non-null float64
22 radius_worst 569 non-null float64
23 texture_worst 569 non-null float64
24 perimeter_worst 569 non-null float64
25 area_worst 569 non-null float64
26 smoothness_worst 569 non-null float64
27 compactness_worst 569 non-null float64
28 concavity_worst 569 non-null float64
29 concave points_worst 569 non-null float64
30 symmetry_worst 569 non-null float64
31 fractal_dimension_worst 569 non-null float64
32 Unnamed: 32 0 non-null float64
dtypes: float64(31), int64(1), object(1)
memory usage: 146.8+ KB
bio.diagnosis.unique()
array(["M", "B"], dtype=object)
- Total 569 records of patients and 33 columns
- id is a unique column that does not contribue to prediction
- unnamed: 32 is a column with no value in it and hence cannot contribute
- diagnosis is an object type feature with only two values: M and B
- diagnosis is the target variable with two categories: M and B
- Remaining all the columns are float type and act as predictor variables
# Remove features "id" & "Unnamed: 32" since they do not help in predicting
bio = bio.drop(["id", "Unnamed: 32"],axis=1)
# Encoding target variable "diagnosis"
bio["diagnosis"] = bio["diagnosis"].map({"M": 0,"B": 1})
#Encoding: a preprocessing technique used to convert categorical variables to number codes.
bio.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | symmetry_mean | ... | radius_worst | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | ... | 25.38 | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 |
1 | 0 | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | ... | 24.99 | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 |
2 | 0 | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | 0.2069 | ... | 23.57 | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 |
3 | 0 | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | 0.2597 | ... | 14.91 | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 |
4 | 0 | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | 0.1809 | ... | 22.54 | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 |
5 rows × 31 columns
bio.shape
(569, 31)
Above dataset is finally transformed with all numerical values. 30 predictor variables and 1 target variable.
X = bio.drop(["diagnosis"],axis = 1)
y = bio["diagnosis"]
X_train,x_test,y_train,y_test=train_test_split(X,y,test_size=1/3,random_state=32)
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
x_test = ss.transform(x_test)
lr = LogisticRegression(random_state=10)
model = lr.fit(X_train, y_train)
y_pred = model.predict(x_test)
print(accuracy_score(y_test,y_pred))
0.9578947368421052
print(model.get_params)
<bound method BaseEstimator.get_params of LogisticRegression(random_state=10)>
- penalty: L1 , L2 and elasticnet. By default penalty is L2.
- C: regularization parameter. This depicts the complexity of model
- solver: Optimization problems can be solved by different solvers
- max_iters: in how many steps the global minima is reached
from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
simplefilter("ignore", category=ConvergenceWarning)
param_grid = [
{"penalty" : ["l1", "l2", "elasticnet"],
"C" : np.logspace(-4, 4, 20),
"solver" : ["lbfgs","newton-cg","liblinear"],
"max_iter" : [500, 1000, 1500]
}
]
3*20*5*4
1200
np.logspace(-4,4,20) #.0001 to 10000
array([1.00000000e-04, 2.63665090e-04, 6.95192796e-04, 1.83298071e-03,
4.83293024e-03, 1.27427499e-02, 3.35981829e-02, 8.85866790e-02,
2.33572147e-01, 6.15848211e-01, 1.62377674e+00, 4.28133240e+00,
1.12883789e+01, 2.97635144e+01, 7.84759970e+01, 2.06913808e+02,
5.45586378e+02, 1.43844989e+03, 3.79269019e+03, 1.00000000e+04])
- It is just way to give range of values that c can take.
- You may also use the simple approach as shown below:
- "C" : [.0001, .001, .01, .1, 1, 10, 100, 1000, 10000]
import warnings
warnings.simplefilter("ignore", category=ConvergenceWarning)
gsc = GridSearchCV(lr, param_grid = param_grid, cv = 10,verbose=True, n_jobs=-1)
tuned_model = gsc.fit(X_train,y_train)
Fitting 10 folds for each of 540 candidates, totalling 5400 fits
tuned_model.best_estimator_
LogisticRegression(C=0.615848211066026, max_iter=500, random_state=10)
print (f"Accuracy - : {tuned_model.score(x_test,y_test):.3f}")
Accuracy - : 0.963
lr_finetuned = lr.set_params(C = 0.615848211066026,
max_iter = 500,
random_state = 10)
model_final = lr_finetuned.fit(X_train, y_train)
y_pred = model_final.predict(x_test)
print(accuracy_score(y_test,y_pred))
0.9631578947368421
model_final.get_params
<bound method BaseEstimator.get_params of LogisticRegression(C=0.615848211066026, max_iter=500, random_state=10)>
model_final.penalty
"l2"
model_final.coef_
array([[-0.48507112, -0.39769552, -0.44891167, -0.5166906 , -0.25795931,
0.22433419, -0.59320734, -0.65494822, 0.14756751, 0.20464987,
-1.03507872, -0.18521633, -0.58477086, -0.78295621, -0.05402549,
0.45576588, -0.01313436, -0.05961217, 0.47307332, 0.38539697,
-0.87864902, -0.94643093, -0.75975075, -0.82213525, -0.58373362,
-0.03479723, -0.57764982, -0.5404704 , -0.51782045, -0.29460082]])
model_l1 = LogisticRegression(penalty="l1",C=1, max_iter =10, random_state=10, solver = "liblinear" )
model_l1_fit = model_l1.fit(X_train,y_train)
y_pred_l1 = model_l1.predict(x_test)
accuracy_score(y_test,y_pred_l1)
0.9526315789473684
model_l1_fit.coef_
array([[ 0. , 0. , 0. , 0. , 0. ,
0. , -0.25240274, -0.8403526 , 0. , 0. ,
-2.41841316, 0. , 0. , -0.02519683, 0. ,
0.27464826, 0. , 0. , 0.27932633, 0.21351214,
-1.55360797, -1.56797121, -0.47396785, -3.07790957, -0.76127109,
0. , -0.59962018, -0.4315926 , -0.21474998, 0. ]])
-
The highest accuracy is obtained after finetuning the model.
-
Base model accuracy: 95.79%; Fine-tuned model accuracy: 96.32%
-
With L1 penalty specific experiment is done just to verify the results.
-
The coefficients of L1 model show how many feature coefficients are equated to zero.
-
So L1 does help in feature selection. All non-zero coefficient features are considered for final model.