Skip to content

nikiandr/INT20H-2022-Hackathon

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

INT20H 2022 Hackathon: Qualification round

Qualification task solution by Team GARCH for INT20H-2022.

Achieved 2nd place on the private leaderboard with AUC 0.98283.

Task

Competition could be found by this link: https://www.kaggle.com/c/techuklon-int20h

The task was to predict to build a model that will predict the segment of churn drivers, ie drivers who will stop using the service based on anonymized features over 4 weeks.

Solution

Our solution is based on the stacking of several gradient boosting decision trees - we used their prediction as additional features for the final one (you can think about this as some kind of mean target encoding)

Submissions

Because the task was to predict churn for each Id regardless of time, we used two ways of data preprocessing for LightGBM:

  • LightGBM: Wide format - Unstack over-time features to receive 4 columns from each. Additionally, we created such features as the mean value over weeks, the number of missing values, and the trend (using linear regression), but that didn't help much.

    [ROC AUC] public: 0.978, private: 0.978

  • LightGBM: Long format - Train model on the the original state of the data (one row per week). Then take average value for each Id.

    [ROC AUC] public: 0.957, private: 0.958

  • Catboost - trained on the wide data format. Its predictions were somewhat different because of another algorithm for tree building.

    [ROC AUC] public: 0.973, private: 0.974

Finally, we concat the out-of-fold predictions of these models to the dataset, and train the same model as in the Wide-format case.
Score: [ROC AUC] public: 0.98278, private: 0.98283 - 2nd place on both leaderboards

Ideas that didn't work:

  • Additional preprocessing: filling nans, replacing zeros with nans, creating feature combinations based on their adjacency in the LightGBM trees
  • Feature selection with Recursive Feature Elimination
  • Nearest neighbors features: ratio of each class in nearest k neighbors, average distance for each class members
  • Hyperparameter tuning - the score difference was insignificant
  • Traditional two-level stacking with other first-level models - kNN, Naive Bayes, Random Forest, and using different approaches to build a meta-model: linear combinations and decision trees

Files:

Final submission takes place in final-model.ipynb.

Other submissions could be found in stacking folder such as:

Additional kernels:

About

solution by Team GARCH

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%