Model training and deploying involves 3 steps:
Model provided in this directory is DNNLinearCombinedClassifier. Refer to Wide & Deep Learning: Better Together with TensorFlow for more details.
NOTE: Before you start training, make sure all features are correctly configured
in workflow/dags/tasks/preprocess/features_config.py
. Refer to step
Configure features.
Before starting training following env variables need to be set:
MODEL_DIR=<model_dir>
TRAIN_DATA=<full_path_to_train_data>
EVAL_DATA=<full_path_to_eval_data>
TRANSFORM_DIR=<path_to_transfomer_dir>
SCHEMA=<path_to_schema_proto>
TRAIN_STEPS=<train_steps>
TRAIN_BATCH_SIZE=<train_batch_size>
EVAL_STEPS=<eval_steps>
EVAL_BATCH_SIZE=<eval_batch_size>
NUM_EPOCHS=<num_epochs>
SAVE_CHECKPOINTS_STEP=<save_checkpoints_step>
KEEP_CHECKPOINT_MAX=<keep_checkpoint_max>
FIRST_LAYER_SIZE=<first_layer_size>
NUM_LAYERS=<num_layers>
DNN_OPTIMIZER=<dnn_optimizer>
LINEAR_OPTIMIZER=<linear_optimizer>
DESCRIPTION:
MODEL_DIR
: Local path or path in Cloud Storage to store checkpoints and save the trained model.TRAIN_DATA
: Local path or path in Cloud Storage holding train dataset.EVAL_DATA
: Local path or path in Cloud Storage holding eval dataset.TRANSFORM_DIR
: Local path or path in Cloud Storage that holds the model. saved during data transformationSCHEMA
: Local path or path in Cloud Storage toschema.pbtxt
file. generated during data transformationTRAIN_STEPS
: Count of steps to run the training job for.TRAIN_BATCH_SIZE
: Train batch size.EVAL_STEPS
: Number of steps to run evaluation for at each checkpoint.EVAL_BATCH_SIZE
: Eval batch size.NUM_EPOCHS
: Number of epochs.SAVE_CHECKPOINTS_STEP
: Save checkpoints every this many steps.KEEP_CHECKPOINT_MAX
: The maximum number of recent checkpoint files to keep.FIRST_LAYER_SIZE
: Size of the first layer.NUM_LAYERS
: Number of layers..DNN_OPTIMIZER
: Optimizer for DNN model.LINEAR_OPTIMIZER
: Optimizer for linear model.
EXAMPLE:
export MODEL_DIR=~/sample_output/model
export TRAIN_DATA=~/sample_output/dataset/train-00000-of-00001.tfrecord
export EVAL_DATA=~/sample_output/dataset/eval-00000-of-00001.tfrecord
export TRANSFORM_DIR=~/sample_output/transformer
export SCHEMA=~/sample_output/transformer/schema.pbtxt
export TRAIN_STEPS=100
export TRAIN_BATCH_SIZE=10
export EVAL_STEPS=100
export EVAL_BATCH_SIZE=10
export NUM_EPOCHS=1
export SAVE_CHECKPOINTS_STEP=2
export KEEP_CHECKPOINT_MAX=1
export FIRST_LAYER_SIZE=4
export NUM_LAYERS=2
export DNN_OPTIMIZER=Adam
export LINEAR_OPTIMIZER=Ftrl
Model can be trained locally or on the Cloud AI Platform.
From root directory of the project, run following:
NOTE: hyperparameters are set arbitrarily.
gcloud ai-platform local train \
--module-name trainer.task \
--package-path trainer/ \
--job-dir $MODEL_DIR \
-- \
--train-data $TRAIN_DATA \
--eval-data $EVAL_DATA \
--transform-dir $TRANSFORM_DIR \
--schema-file $SCHEMA \
--train-steps $TRAIN_STEPS \
--eval-steps $TRAIN_BATCH_SIZE \
--eval-batch-size $EVAL_STEPS \
--train-batch-size $EVAL_BATCH_SIZE \
--num-epochs $NUM_EPOCHS \
--save-checkpoints-steps $SAVE_CHECKPOINTS_STEP \
--keep-checkpoint-max $KEEP_CHECKPOINT_MAX \
--first-layer-size $FIRST_LAYER_SIZE \
--num-layers $NUM_LAYERS \
--dnn-optimizer $DNN_OPTIMIZER \
--linear-optimizer $LINEAR_OPTIMIZER
gcloud ml-engine jobs submit training $JOB_NAME \
--stream-logs \
--job-dir $OUTPUT_PATH \
--module-name trainer.task \
--package-path trainer/ \
--region $REGION \
--scale-tier STANDARD_1 \
-- \
--train-data $TRAIN_DATA \
--eval-data $EVAL_DATA \
--transform-dir $TRANSFORM_DIR \
--schema-file $SCHEMA \
--train-steps $TRAIN_STEPS \
--eval-steps $TRAIN_BATCH_SIZE \
--eval-batch-size $EVAL_STEPS \
--train-batch-size $EVAL_BATCH_SIZE \
--num-epochs $NUM_EPOCHS \
--save-checkpoints-steps $SAVE_CHECKPOINTS_STEP \
--keep-checkpoint-max $KEEP_CHECKPOINT_MAX \
--first-layer-size $FIRST_LAYER_SIZE \
--num-layers $NUM_LAYERS \
--dnn-optimizer $DNN_OPTIMIZER \
--linear-optimizer $LINEAR_OPTIMIZER
Refer to this page for more details on training model on Google AI platform.
Model can be deployed manually using Cloud SDK or as a part of Driblet setup.
In order to deploy saved model on AI Platform, first you need to create a model instance on the Cloud and then deploy. For further details, check this help page.
First, setup environment variables:
MODEL_NAME=<model_name>
REGION=<asia-northeast1> # Or another region
MODEL_VERSION=<model_version>
MODEL_DIR=<path_to_saved_model>
DESCRIPTION:
MODEL_NAME
: Name of the model.REGION
: Cloud region, refer to available regions.MODEL_VERSION
: Version of your model.MODEL_DIR
: Path to saved model (local directory or Cloud Storage path).
Create a model:
gcloud ml-engine models create ${MODEL_NAME} \
--regions $REGION
Deploy model:
# Deploy model with version specified above
gcloud ml-engine versions create ${MODEL_VERSION} \
--model ${MODEL_NAME} \
--origin ${MODEL_DIR}
For automatic deployment as a part of Cloud environment setup, refer to Cloud environment setup.