Heart disease occurs when the heart's blood supply is blocked by a build-up of plague (fatty substances) in the coronary arteries. It is a leading cause of death worldwide. It is also a treatable and preventable disease, given the appropriate medication and life style changes. Therefore, prediction of heart disease is of great social value and can facilitate early treatment of patients.
The heart disease dataset can be obtained from the UCI Machine Learning Repository. The full dataset has 76 attributes. In this notebook we use the processed version, which containes a subset (14) of the attributes, to train our model.
- Age
- Sex
- CP - Chest Pain Types (total of 4 types)
- Trestbps - Resting Blood Pressure (mmHg)
- Chol - Serum Cholesterol Level (mm/dl)
- Fbs - Fasting Blood Sugar > 120 mg/dl (0 for false, 1 for true)
- Restecg - Resting Electrocardiogram results
- Thalac - Maximum Heart Rate
- Exang - Exercise Induced Angina (0 for false, 1 for true)
- Oldpeak - ST depression induced by exercise relative to rest
- Slope - Slope of peak exercise ST segment
- Ca - Number of major vessels (0-3) colored by flourosopy
- Thal - 3 = normal; 6 = fixed defect; 7 = reversable defect
- Num - the predicted attribute (0 for no heart disease, 1 for heart disease)
We do not neccessarily have to understand what every metrics mean medically, but it's important to know if it's a continuous or categorical variable as it will influence how we process the data.
Source: (https://medium.com/swlh/decision-tree-classification-de64fc4d5aac)
A decision tree resembles how human make decisions with flowcharts. In each nodes there is a criteria or question, and the answer to that will route the decision process to the left or right node. Leaf refers to the bottom layer of the tree where the decision has been reached, in classification problems, leaf stores the class labels (in this case, fit and unfit are the leaf nodes).
Decision are notorious for overfitting. Overfitting means a model fits the training data very well but generalize poorly when used on testing data or other real-world data. One way decision trees deal with overfitting is with pruning. Pruning refers to the process of removing nodes from the decision tree. In this notebook, we will explore cost-complexity pruning.
Cost-complexity pruning is done by comparing the residuals of a series of trees, each with one less node than the previous one. Alpha acts like a penalty score that is scaled by the number of terminal nodes in the trees. So a higher alpha would favor a tree with fewer nodes (keep in mind that the goal is to minimise the residuals in ML).
- StatQuest Decision Tree (https://www.youtube.com/watch?v=q90UDEgYqeI)
- Hands–On Machine Learning with Scikit–Learn and TensorFlow - Aurelien Geron