CAD Detection

Introduction

Coronary Heart Disease (CAD) was one of the most lethal disease in United States. In order to create an accurate model and analyze the remarkable risk factors, a data mining classification task that involes 5 different methods (logistic, LDA, QDA, Tree, SVM) is applied onn the Z-Alizadeh Sani Dataset.

Data Set Information

The Z-Alizadeh Sani Dataset (Z-Ali Dataset) is collected by Dr. Zahra Alizadeh Sani, the Associate Professor of Cardiology at Iran University in November 2017. The dataset has 303 observations and 56 variables, This dataset is restored in UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/index.php

Variable Selection

Generally, more complex model = more variance. According to correlation matrix, a lot of 55 predictors will provide invaluable or redundant information.

Correlation matrix for Z-Ali dataset(Top 20 with "Cath")

In order to findout valuable predictors, 200 Random Forests are performed and the top 8 predictors for each random forest are recorded.

Variable	Frequency
Typical.Ches.Pain	200
Age	200
Atypical	200
EF.TTE	200
Region.RWMA	179
TG	106
FBS	104
HTN	100
BMI	88
Nonanginal	69
ESR	56
Tinversation	50
BP	48

At the end, we are able to show that the top 8 predictors in the frequency table are valuable in CAD detection. And those 8 variable will be used to build reduced models and obtaining 10-fold Cross Validation Error:

Performance with reduced model and full model

Classification Method	10-fold CV Error (Full)	10-fold CV Error (Reduced)
Logistic Regression	15.86%	13.12%
LDA	14.81%	14.46%
QDA	N/A	15.75%
Decision Tree	22.46%	12.21% (Random Forest)
SVM (Radial Kernel)	13.23%	13.24%
SVM (Linear Kernel)	16.47%	13.24%
Average	16.57%	13.67%

Conclusions

10-fold cross validation are performed for both full and reduced model. From the results we can conclude that: By uisng the reduced model, the misclassification decrased about 3%.

By comparing the performance of 5 different classification methods, we also can make the following assumptions:

Logistic Regression is not sensitive to predictors which provide redundant information. Variable selection process can significantly improve the performance.
LDA has problem with more categorical predictors.
QDA is similar to LDA. But the number of parameters will become huge as having more predictors.
Decision Tree with Random Forest is the best classifier for this data set. Decision tree models nonlinear relationships and handles categorical predictors effectively.
SVM shows a strong ability to deal with redundant information. In general, SVMs are more focused on model regulation rather than feature selection. This is the reason why the error rate with SVM almost no change after variable selection.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
Outputs		Outputs
CODE.R		CODE.R
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CAD Detection

Introduction

Data Set Information

Variable Selection

Performance with reduced model and full model

Conclusions

About

Releases

Packages

Languages

tanmay-dixit/CAD

Folders and files

Latest commit

History

Repository files navigation

CAD Detection

Introduction

Data Set Information

Variable Selection

Performance with reduced model and full model

Conclusions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages