In this all-in-one demo on Amazon EKS, we use JupyterHub to guide you through the initial stages of serving a pre-trained model and fine-tuning it with new data. We then shift gears to utilize Ray Operator for optimized serving and Karpenter for custom resource allocation. The demo showcases how to achieve cost-effectiveness and multi-tenancy with various Ray clusters, all while walking you through each step in a Jupyter notebook.
- AWS Credentials configured
- AWS CLI
- kubectl
- Helm
- Terraform
- Spot Instance Linked Role
aws iam create-service-linked-role --aws-service-name spot.amazonaws.com
The environment required to train the models in this demonstration, will require 8-10 g5.4xlarge
Amazon EC2 instances which together will sum up to 160 vCPU of the G5 Instance family, make sure that you have the enough Service Quota available in your AWS Account to acomodate the resources that are being created.
You can use this deep link to validate your current AWS Service Quota limits. Check the Applied quota value, and Request increase at account-level clicking on the top right button, if needed.
terraform init
terraform plan
terraform apply --auto-approve
This command provisions an EKS cluster along with the following components:
- JupyterHub: For development and analysis
- Nvidia GPU Operator: The GPU Operator allows administrators of Kubernetes clusters to manage GPU nodes just like CPU nodes in the cluster, instead of provisioning a special OS image.
- Ray Operator: To manage Ray clusters
- Karpenter: For automatic scaling
- Kube Prometheus Stack: For observability
- Apache Airflow: To automate the e2e ML pipeline, fetching DAGs from this Git repository
Since we will be pushing code to Amazon S3 let's export the BUCKET_NAME
export BUCKET_NAME=$(terraform output -raw bucket_name)
Have this bucket name handy, we will use it troughout the demo
terraform output -raw configure_kubectl | bash
kubectl get nodes
You should see output similar to:
NAME STATUS ROLES AGE VERSION
ip-10-8-10-118.us-west-2.compute.internal Ready <none> 50m v1.27.5-eks-43840fb
ip-10-8-17-213.us-west-2.compute.internal Ready <none> 51m v1.27.5
ip-10-8-17-85.us-west-2.compute.internal Ready <none> 51m v1.27.5
ip-10-8-21-138.us-west-2.compute.internal Ready <none> 51m v1.27.5
ip-10-8-22-37.us-west-2.compute.internal Ready <none> 51m v1.27.5
ip-10-8-30-194.us-west-2.compute.internal Ready <none> 22m v1.27.5-eks-43840fb
ip-10-8-32-55.us-west-2.compute.internal Ready <none> 24m v1.27.5-eks-43840fb
Validade if all Pods are Running
and in Ready
state.
kubectl get pods -A
You're now ready to proceed with the demonstration.
The demonstration is broken down into two modules, each focusing on a specific aspect of fine-tuning Foundation Models like Falcon 7B on Amazon EKS. By the end of this demonstration, you'll have learned how to use Notebooks powered by JupyterHub to craft your training and serving script and run them on specific Ray Clusters.
1. Experimenting and creating our Serving and Training scripts for fine-tuning purposes in a Jupyter Notebook
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.
To tear down your environment, run the tear-down.sh
script inside the terraform/scripts
directory.
sh scripts/tear-down.sh