Track №3 task solution by Team GARCH with RMSE 20.12592
Competition could be found by this link: https://www.kaggle.com/c/pti-hack
The task was to predict the probability of successful deal closing, having the history of interactions with clients.
Our solution approach consists of building a Classifier, then using LightGBM and CatBoost separately for successful and unsuccessful cases and stacking them. Since the train and test data had a non-zero intersection, we had to define a correct time-aware prediction scheme (so that we could predict the target for each new element based only on past data).
Reasons for using this method:
- Avoid overfitting at the intersection of train and test
- Avoid occuring leaks during the generation of new features: preventing situation where the past flows into the future
- Dates:
CreatedDate
,CreatedDateForInsert
,ValidThroughDate
- differences, quarters, years, sin-cos encoding - Lags: Stats of previous probabilities grouped by
Opportunity
,CreatedBy
and periods - Categorical:
CreatedById
,AccountId
,RecordTypeId
,Type
,LeadSource
,CampaignId
etc. - CountVectorizer:
Needs__c
- dividing into more features.
- Divide the target by 100 and build the LightGBM model with the logloss objective.
- Build a classifier model (target -
StageName
- forecast of how the deal will end at the very end: 0 - unsuccessfully, 1 - successfully). - For each point in the dataset, predict the value
- Divide the dataset into 2 parts: successful and unsuccessful cases.
- On each of the parts, build a separate LGBMRegressor and CatBoost to predict the final value of the probability.
- Stacking of CatBoost and LightGBM models (with coefficients 0.4 and 0.6, respectively) in each of the categories.
pti_hack.ipynb
(nbviewer) - Main notebook used for creating the final stacking model
Because of exhaustive pointwise time-respecting predictions for stacking - the notebook takes approximately 1 hour to run on 16 CPUs / n_jobs=32.
The same code in .py script and additional files to run within the Docker container
cd PTI-Hack-2022/docker
docker build -t pti_hack .
docker tag pti_hack:latest <your_username>/pti_hack:latest
docker push <your_username>/pti_hack
KAGGLE_USERNAME - username in Kaggle
KAGGLE_TOKEN - Kaggle API token
N_JOBS - number of jobs for parallel execution
curl -H "Content-Type: application/json" \
-H "Authorization: Zod58 {{your_api_key}}" \
-X POST https://offchain.zod.tv/job_new -d @- << EOF
{
"type": "docker",
"path": "docker.io/<your_username>/pti_hack",
"cpu": 32,
"ram": 64,
"disk": 30
}
EOF
The prediction will be submitted automatically after execution ends.
For some reasons container sometimes fails with the joblib.externals.loky.process_executor.TerminatedWorkerError
when running on zod.tv
Consider N_JOBS
to be small enough to prevent this (but this can significantly slow down the learning speed).