The Heritage Health Prize is a $3 million reward for the team which can best “identify patients who will be admitted to a hospital within the next year, using historical claims data.” The purpose of this competition is apparent when considering that over $30 billion was spent on unnecessary hospital admissions alone in 2006. An accurate model would allow health care providers to administer more personalized care, thereby decreasing both these unnecessary hospital admissions and medical spending as a whole.
The prize is hosted by Kaggle, a website where teams of researchers tackle machine learning problems in a competitive environment. Kaggle provides relevant datasets as well as quantitative feedback for predictions made. The team that produces the most accurate model within the time frame wins the competition and is typically compensated in return for a description of their method.
https://foreverdata.org/1015/index.html
(Due to Github"s limitation, archived dataset have been store in data.zip. Please extract and save it in ./data folder)
To measure the performance of model, we use a score called RMSLE (Root Mean Squared Logarithmic Error):
Where:
- i is a member;
- n is the total number of members;
- p is the predicted number of days spent in hospital for member i in the test period;
- a is the actual number of days spent in hospital for member i in the test period.
- Importing data
- Data cleaning (Hanlding missing value, hanlding categorical and continuous data,...)
- Merging files
- Handling outliners
- Exploring data (Visualizations, Distribution of features,...)
- Feature scaling (Standardisation, Mean Normalisation, Min-Max Scaling,... )
- Linear Regression
- Ridge Regression
- Support Vector Regression
- Random Forests
- Logistic Regression
- Neural Networks
- Gradient Boosting Machines
- Model cross-validation (Stratified K-Fold)
- Hyperparameter tuning
- Models selection (Observation models correlations)
- Ensembling
Predictors | Score |
---|---|
Linear Regression | 0.4850 |
Ridge Regression | 0.4844 |
Support Vector Regression | 0.4783 |
Random Forests | 0.4846 |
Logistic Regression | 0.5065 |
Neural Networks | 0.48 |
Gradient Boosting Machines | 0.4790 |
Ensemble result: 0.00
Optimal constant value + Ensemble (Classification predictors): 0.4653