Skip to content
This repository has been archived by the owner on Jul 30, 2023. It is now read-only.

Latest commit

 

History

History

trainer

Model Training Guide

Model training and deploying involves 3 steps:

  1. Environment setup
  2. Train the model
  3. Deploy the model

Model provided in this directory is DNNLinearCombinedClassifier. Refer to Wide & Deep Learning: Better Together with TensorFlow for more details.

NOTE: Before you start training, make sure all features are correctly configured in workflow/dags/tasks/preprocess/features_config.py. Refer to step Configure features.

Step 1: Environment setup

Before starting training following env variables need to be set:

MODEL_DIR=<model_dir>
TRAIN_DATA=<full_path_to_train_data>
EVAL_DATA=<full_path_to_eval_data>
TRANSFORM_DIR=<path_to_transfomer_dir>
SCHEMA=<path_to_schema_proto>
TRAIN_STEPS=<train_steps>
TRAIN_BATCH_SIZE=<train_batch_size>
EVAL_STEPS=<eval_steps>
EVAL_BATCH_SIZE=<eval_batch_size>
NUM_EPOCHS=<num_epochs>
SAVE_CHECKPOINTS_STEP=<save_checkpoints_step>
KEEP_CHECKPOINT_MAX=<keep_checkpoint_max>
FIRST_LAYER_SIZE=<first_layer_size>
NUM_LAYERS=<num_layers>
DNN_OPTIMIZER=<dnn_optimizer>
LINEAR_OPTIMIZER=<linear_optimizer>

DESCRIPTION:

  • MODEL_DIR: Local path or path in Cloud Storage to store checkpoints and save the trained model.
  • TRAIN_DATA: Local path or path in Cloud Storage holding train dataset.
  • EVAL_DATA: Local path or path in Cloud Storage holding eval dataset.
  • TRANSFORM_DIR: Local path or path in Cloud Storage that holds the model. saved during data transformation
  • SCHEMA: Local path or path in Cloud Storage to schema.pbtxt file. generated during data transformation
  • TRAIN_STEPS: Count of steps to run the training job for.
  • TRAIN_BATCH_SIZE: Train batch size.
  • EVAL_STEPS: Number of steps to run evaluation for at each checkpoint.
  • EVAL_BATCH_SIZE: Eval batch size.
  • NUM_EPOCHS: Number of epochs.
  • SAVE_CHECKPOINTS_STEP: Save checkpoints every this many steps.
  • KEEP_CHECKPOINT_MAX: The maximum number of recent checkpoint files to keep.
  • FIRST_LAYER_SIZE: Size of the first layer.
  • NUM_LAYERS: Number of layers..
  • DNN_OPTIMIZER: Optimizer for DNN model.
  • LINEAR_OPTIMIZER: Optimizer for linear model.

EXAMPLE:

export MODEL_DIR=~/sample_output/model
export TRAIN_DATA=~/sample_output/dataset/train-00000-of-00001.tfrecord
export EVAL_DATA=~/sample_output/dataset/eval-00000-of-00001.tfrecord
export TRANSFORM_DIR=~/sample_output/transformer
export SCHEMA=~/sample_output/transformer/schema.pbtxt
export TRAIN_STEPS=100
export TRAIN_BATCH_SIZE=10
export EVAL_STEPS=100
export EVAL_BATCH_SIZE=10
export NUM_EPOCHS=1
export SAVE_CHECKPOINTS_STEP=2
export KEEP_CHECKPOINT_MAX=1
export FIRST_LAYER_SIZE=4
export NUM_LAYERS=2
export DNN_OPTIMIZER=Adam
export LINEAR_OPTIMIZER=Ftrl

Step 2: Train the model

Model can be trained locally or on the Cloud AI Platform.

Option 1: Train locally

From root directory of the project, run following:

NOTE: hyperparameters are set arbitrarily.

gcloud ai-platform local train \
    --module-name trainer.task \
    --package-path trainer/ \
    --job-dir $MODEL_DIR \
    -- \
    --train-data $TRAIN_DATA \
    --eval-data $EVAL_DATA \
    --transform-dir $TRANSFORM_DIR \
    --schema-file $SCHEMA \
    --train-steps $TRAIN_STEPS \
    --eval-steps $TRAIN_BATCH_SIZE \
    --eval-batch-size $EVAL_STEPS \
    --train-batch-size $EVAL_BATCH_SIZE \
    --num-epochs $NUM_EPOCHS \
    --save-checkpoints-steps $SAVE_CHECKPOINTS_STEP \
    --keep-checkpoint-max $KEEP_CHECKPOINT_MAX \
    --first-layer-size $FIRST_LAYER_SIZE \
    --num-layers $NUM_LAYERS \
    --dnn-optimizer $DNN_OPTIMIZER \
    --linear-optimizer $LINEAR_OPTIMIZER

Option 2: Train on the Cloud

gcloud ml-engine jobs submit training $JOB_NAME \
    --stream-logs \
    --job-dir $OUTPUT_PATH \
    --module-name trainer.task \
    --package-path trainer/ \
    --region $REGION \
    --scale-tier STANDARD_1 \
    -- \
    --train-data $TRAIN_DATA \
    --eval-data $EVAL_DATA \
    --transform-dir $TRANSFORM_DIR \
    --schema-file $SCHEMA \
    --train-steps $TRAIN_STEPS \
    --eval-steps $TRAIN_BATCH_SIZE \
    --eval-batch-size $EVAL_STEPS \
    --train-batch-size $EVAL_BATCH_SIZE \
    --num-epochs $NUM_EPOCHS \
    --save-checkpoints-steps $SAVE_CHECKPOINTS_STEP \
    --keep-checkpoint-max $KEEP_CHECKPOINT_MAX \
    --first-layer-size $FIRST_LAYER_SIZE \
    --num-layers $NUM_LAYERS \
    --dnn-optimizer $DNN_OPTIMIZER \
    --linear-optimizer $LINEAR_OPTIMIZER

Refer to this page for more details on training model on Google AI platform.

Step 3: Deploy the model

Model can be deployed manually using Cloud SDK or as a part of Driblet setup.

Option 1: Manual deployment

In order to deploy saved model on AI Platform, first you need to create a model instance on the Cloud and then deploy. For further details, check this help page.

First, setup environment variables:

MODEL_NAME=<model_name>
REGION=<asia-northeast1> # Or another region
MODEL_VERSION=<model_version>
MODEL_DIR=<path_to_saved_model>

DESCRIPTION:

  • MODEL_NAME: Name of the model.
  • REGION: Cloud region, refer to available regions.
  • MODEL_VERSION: Version of your model.
  • MODEL_DIR: Path to saved model (local directory or Cloud Storage path).

Create a model:

gcloud ml-engine models create ${MODEL_NAME} \
    --regions $REGION

Deploy model:

# Deploy model with version specified above
gcloud ml-engine versions create ${MODEL_VERSION} \
    --model ${MODEL_NAME} \
    --origin ${MODEL_DIR}

Option 2: Automatic deployment

For automatic deployment as a part of Cloud environment setup, refer to Cloud environment setup.