INT20H 2022 Hackathon: Qualification round

Qualification task solution by Team GARCH for INT20H-2022.

Achieved 2nd place on the private leaderboard with AUC 0.98283.

Task

Competition could be found by this link: https://www.kaggle.com/c/techuklon-int20h

The task was to predict to build a model that will predict the segment of churn drivers, ie drivers who will stop using the service based on anonymized features over 4 weeks.

Solution

Our solution is based on the stacking of several gradient boosting decision trees - we used their prediction as additional features for the final one (you can think about this as some kind of mean target encoding)

Submissions

Because the task was to predict churn for each Id regardless of time, we used two ways of data preprocessing for LightGBM:

LightGBM: Wide format - Unstack over-time features to receive 4 columns from each. Additionally, we created such features as the mean value over weeks, the number of missing values, and the trend (using linear regression), but that didn't help much.

[ROC AUC] public: 0.978, private: 0.978
LightGBM: Long format - Train model on the the original state of the data (one row per week). Then take average value for each Id.

[ROC AUC] public: 0.957, private: 0.958
Catboost - trained on the wide data format. Its predictions were somewhat different because of another algorithm for tree building.

[ROC AUC] public: 0.973, private: 0.974

Finally, we concat the out-of-fold predictions of these models to the dataset, and train the same model as in the Wide-format case.
Score: [ROC AUC] public: 0.98278, private: 0.98283 - 2nd place on both leaderboards

Ideas that didn't work:

Additional preprocessing: filling nans, replacing zeros with nans, creating feature combinations based on their adjacency in the LightGBM trees
Feature selection with Recursive Feature Elimination
Nearest neighbors features: ratio of each class in nearest k neighbors, average distance for each class members
Hyperparameter tuning - the score difference was insignificant
Traditional two-level stacking with other first-level models - kNN, Naive Bayes, Random Forest, and using different approaches to build a meta-model: linear combinations and decision trees

Files:

Final submission takes place in final-model.ipynb.

Other submissions could be found in stacking folder such as:

Catboost: int20h-qual-catboost.ipynb
LightGBM: lgbm-wide.ipynb
LightGBM (Long format): lgbm-long.ipynb

Additional kernels:

Combining of the boosting predictions gbdt-stacking-preprocessing.ipynb
Feature visualization: int20h-feature-exploration.ipynb

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
stacking		stacking
README.md		README.md
final-model.ipynb		final-model.ipynb
int20h-feature-exploration.ipynb		int20h-feature-exploration.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

INT20H 2022 Hackathon: Qualification round

Task

Solution

Submissions

Ideas that didn't work:

Files:

About

Releases

Packages

Languages

nikiandr/INT20H-2022-Hackathon

Folders and files

Latest commit

History

Repository files navigation

INT20H 2022 Hackathon: Qualification round

Task

Solution

Submissions

Ideas that didn't work:

Files:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages