Help train on tpu v3-32 #74

StableFluffy · 2024-01-04T23:51:46Z

StableFluffy
Jan 4, 2024

#34
I read this issue and tried it. but couldn"t make it work :(

Hi, Thank you for your amazing work.

I"ve been trying few days to make tpu v3-32 to work.

I used tpu VM "tpu-ubuntu2204-base" and tried by following code after installing jax and etc to each tpus

train.py


import jax.numpy
from EasyDel import (
    TrainArguments,
    CausalLanguageModelTrainer,
    AutoEasyDelModelForCausalLM,
    EasyDelOptimizers,
    EasyDelSchedulers,
    EasyDelGradientCheckPointers
)
from datasets import load_dataset
import flax
from jax import numpy as jnp
from transformers import AutoTokenizer

huggingface_repo_id_or_path = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"

model, params = AutoEasyDelModelForCausalLM.from_pretrained(huggingface_repo_id_or_path, )

max_length = 2048
tokenizer = AutoTokenizer.from_pretrained(
    huggingface_repo_id_or_path,
    trust_remote_code=True
)
tokenizer.pad_token = tokenizer.eos_token
configs_to_init_model_class = {
    "config": model.config,
    "dtype": jnp.bfloat16,
    "param_dtype": jnp.bfloat16,
    "input_shape": (1, 1)
}

train_arguments = TrainArguments(
    model_class=type(model),
    model_name="my_first_model_to_train_using_easydel",
    num_train_epochs=3,
    learning_rate=5e-5,
    learning_rate_end=1e-6,
    optimizer=EasyDelOptimizers.ADAMW,  # "adamw", "lion", "adafactor" are supported
    scheduler=EasyDelSchedulers.LINEAR,
    # "linear","cosine", "none" ,"warm_up_cosine" and "warm_up_linear"  are supported
    weight_decay=0.01,
    total_batch_size=64,
    max_steps=None,  # None to let trainer Decide
    do_train=True,
    do_eval=False,  # it"s optional but supported 
    backend="tpu",  # default backed is set to cpu, so you must define you want to use tpu cpu or gpu
    max_length=max_length,  # Note that you have to change this in the model config too
    gradient_checkpointing=EasyDelGradientCheckPointers.NOTHING_SAVEABLE,
    sharding_array=(1, -1, 1, 1),  # the way to shard model across gpu,cpu or TPUs using sharding array (1, -1, 1, 1)
    # everything training will be in fully FSDP automatic and share data between devices
    use_pjit_attention_force=False,
    remove_ckpt_after_load=True,
    gradient_accumulation_steps=8,
    loss_remat="",
    dtype=jnp.bfloat16
)


def ultra_chat_prompting_process(
        data_chunk
):
    user_part = [
        chunk["content"] for chunk in data_chunk["messages"] if chunk["role"] == "user"
    ]
    assistant_part = [
        chunk["content"] for chunk in data_chunk["messages"] if chunk["role"] == "assistant"
    ]

    prompt = ""

    for uc, ac in zip(user_part, assistant_part):
        prompt += f"<|user|>\n{uc}</s>\n<|assistant|>\n{ac}</s>\n"

    return {"prompt": prompt}


tokenization_process = lambda data_chunk: tokenizer(
    data_chunk["prompt"],
    add_special_tokens=False,
    max_length=max_length,
    padding="max_length"
)

dataset = load_dataset("HuggingFaceH4/ultrachat_200k")
dataset_train = dataset["train_gen"].map(ultra_chat_prompting_process, num_proc=12)
dataset_train = dataset_train.map(
    tokenization_process,
    num_proc=12,
    remove_columns=dataset_train.column_names
)

# you can do the same for evaluation process dataset

trainer = CausalLanguageModelTrainer(
    train_arguments,
    dataset,
    checkpoint_path=None
)

output = trainer.train(flax.core.FrozenDict({"params": params}))
print(f"Hey ! , here"s where your model saved {output.last_save_file_name}")

Then I sent it to tpus by
sudo gcloud compute tpus tpu-vm scp train.py node-1: --worker=all --zone=europe-west4-a

and ran it
sudo gcloud compute tpus tpu-vm ssh node-1 --zone=europe-west4-a --worker=all --command="python3 train.py"

and got error

SSH: Attempting to connect to worker 0...
SSH: Attempting to connect to worker 1...
SSH: Attempting to connect to worker 2...
SSH: Attempting to connect to worker 3...
/usr/local/lib/python3.10/dist-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode="default".
  table = cls._concat_blocks(blocks, axis=0)
/usr/local/lib/python3.10/dist-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode="default".
  table = cls._concat_blocks(blocks, axis=0)
/usr/local/lib/python3.10/dist-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode="default".
  table = cls._concat_blocks(blocks, axis=0)
/usr/local/lib/python3.10/dist-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode="default".
  table = cls._concat_blocks(blocks, axis=0)
wandb: Currently logged in as: cyine. Use `wandb login --relogin` to force relogin
wandb: Currently logged in as: cyine. Use `wandb login --relogin` to force relogin
wandb: Currently logged in as: cyine. Use `wandb login --relogin` to force relogin
wandb: Currently logged in as: cyine. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.16.1
wandb: Run data is saved locally in /root/wandb/run-20240104_234925-ri9ge2oc
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run vibrant-sound-5
wandb: ⭐️ View project at https://wandb.ai/cyine/easydel-my_first_model_to_train_using_easydel
wandb: 🚀 View run at https://wandb.ai/cyine/easydel-my_first_model_to_train_using_easydel/runs/ri9ge2oc
Warning : In case of using finetune = True and Passing checkpoint_path = None you should pass parametersin train function
Time For configure dataloaders (ms) : 0.2741813659667969
Traceback (most recent call last):
  File "/root/train.py", line 94, in <module>
    trainer = CausalLanguageModelTrainer(
  File "/usr/local/lib/python3.10/dist-packages/EasyDel/trainer/causal_language_model_trainer.py", line 233, in __init__
    self.init_functions()
  File "/usr/local/lib/python3.10/dist-packages/EasyDel/trainer/causal_language_model_trainer.py", line 305, in init_functions
    self.model, self.tx, self.scheduler, self.config = self.configure_model()
  File "/usr/local/lib/python3.10/dist-packages/EasyDel/trainer/causal_language_model_trainer.py", line 381, in configure_model
    if not hasattr(self.arguments.configs_to_init_model_class["config"], "get_partition_rules"):
TypeError: "NoneType" object is not subscriptable
Traceback (most recent call last):
  File "/root/train.py", line 94, in <module>
    trainer = CausalLanguageModelTrainer(
  File "/usr/local/lib/python3.10/dist-packages/EasyDel/trainer/causal_language_model_trainer.py", line 233, in __init__
    self.init_functions()
  File "/usr/local/lib/python3.10/dist-packages/EasyDel/trainer/causal_language_model_trainer.py", line 305, in init_functions
    self.model, self.tx, self.scheduler, self.config = self.configure_model()
  File "/usr/local/lib/python3.10/dist-packages/EasyDel/trainer/causal_language_model_trainer.py", line 381, in configure_model
    if not hasattr(self.arguments.configs_to_init_model_class["config"], "get_partition_rules"):
TypeError: "NoneType" object is not subscriptable
wandb: Tracking run with wandb version 0.16.1
wandb: Run data is saved locally in /root/wandb/run-20240104_234925-enyk3wwr
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run vital-sun-5
wandb: ⭐️ View project at https://wandb.ai/cyine/easydel-my_first_model_to_train_using_easydel
wandb: 🚀 View run at https://wandb.ai/cyine/easydel-my_first_model_to_train_using_easydel/runs/enyk3wwr
Warning : In case of using finetune = True and Passing checkpoint_path = None you should pass parametersin train function
Time For configure dataloaders (ms) : 0.27751922607421875
Traceback (most recent call last):
  File "/root/train.py", line 94, in <module>
    trainer = CausalLanguageModelTrainer(
  File "/usr/local/lib/python3.10/dist-packages/EasyDel/trainer/causal_language_model_trainer.py", line 233, in __init__
    self.init_functions()
  File "/usr/local/lib/python3.10/dist-packages/EasyDel/trainer/causal_language_model_trainer.py", line 305, in init_functions
    self.model, self.tx, self.scheduler, self.config = self.configure_model()
  File "/usr/local/lib/python3.10/dist-packages/EasyDel/trainer/causal_language_model_trainer.py", line 381, in configure_model
    if not hasattr(self.arguments.configs_to_init_model_class["config"], "get_partition_rules"):
TypeError: "NoneType" object is not subscriptable
Traceback (most recent call last):
  File "/root/train.py", line 94, in <module>
    trainer = CausalLanguageModelTrainer(
  File "/usr/local/lib/python3.10/dist-packages/EasyDel/trainer/causal_language_model_trainer.py", line 233, in __init__
    self.init_functions()
  File "/usr/local/lib/python3.10/dist-packages/EasyDel/trainer/causal_language_model_trainer.py", line 305, in init_functions
    self.model, self.tx, self.scheduler, self.config = self.configure_model()
  File "/usr/local/lib/python3.10/dist-packages/EasyDel/trainer/causal_language_model_trainer.py", line 381, in configure_model
    if not hasattr(self.arguments.configs_to_init_model_class["config"], "get_partition_rules"):
TypeError: "NoneType" object is not subscriptable
wandb: Tracking run with wandb version 0.16.1
wandb: Run data is saved locally in /root/wandb/run-20240104_234925-wvn0sguc
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run northern-vortex-5
wandb: ⭐️ View project at https://wandb.ai/cyine/easydel-my_first_model_to_train_using_easydel
wandb: 🚀 View run at https://wandb.ai/cyine/easydel-my_first_model_to_train_using_easydel/runs/wvn0sguc
Warning : In case of using finetune = True and Passing checkpoint_path = None you should pass parametersin train function
Time For configure dataloaders (ms) : 0.29778480529785156
Traceback (most recent call last):
  File "/root/train.py", line 94, in <module>
    trainer = CausalLanguageModelTrainer(
  File "/usr/local/lib/python3.10/dist-packages/EasyDel/trainer/causal_language_model_trainer.py", line 233, in __init__
    self.init_functions()
  File "/usr/local/lib/python3.10/dist-packages/EasyDel/trainer/causal_language_model_trainer.py", line 305, in init_functions
    self.model, self.tx, self.scheduler, self.config = self.configure_model()
  File "/usr/local/lib/python3.10/dist-packages/EasyDel/trainer/causal_language_model_trainer.py", line 381, in configure_model
    if not hasattr(self.arguments.configs_to_init_model_class["config"], "get_partition_rules"):
TypeError: "NoneType" object is not subscriptable
Traceback (most recent call last):
  File "/root/train.py", line 94, in <module>
    trainer = CausalLanguageModelTrainer(
  File "/usr/local/lib/python3.10/dist-packages/EasyDel/trainer/causal_language_model_trainer.py", line 233, in __init__
    self.init_functions()
  File "/usr/local/lib/python3.10/dist-packages/EasyDel/trainer/causal_language_model_trainer.py", line 305, in init_functions
    self.model, self.tx, self.scheduler, self.config = self.configure_model()
  File "/usr/local/lib/python3.10/dist-packages/EasyDel/trainer/causal_language_model_trainer.py", line 381, in configure_model
    if not hasattr(self.arguments.configs_to_init_model_class["config"], "get_partition_rules"):
TypeError: "NoneType" object is not subscriptable
wandb: Tracking run with wandb version 0.16.1
wandb: Run data is saved locally in /root/wandb/run-20240104_234925-cqws9qy2
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run dazzling-sea-8
wandb: ⭐️ View project at https://wandb.ai/cyine/easydel-my_first_model_to_train_using_easydel
wandb: 🚀 View run at https://wandb.ai/cyine/easydel-my_first_model_to_train_using_easydel/runs/cqws9qy2
Warning : In case of using finetune = True and Passing checkpoint_path = None you should pass parametersin train function
Time For configure dataloaders (ms) : 0.278472900390625
Traceback (most recent call last):
  File "/root/train.py", line 94, in <module>
    trainer = CausalLanguageModelTrainer(
  File "/usr/local/lib/python3.10/dist-packages/EasyDel/trainer/causal_language_model_trainer.py", line 233, in __init__
    self.init_functions()
  File "/usr/local/lib/python3.10/dist-packages/EasyDel/trainer/causal_language_model_trainer.py", line 305, in init_functions
    self.model, self.tx, self.scheduler, self.config = self.configure_model()
  File "/usr/local/lib/python3.10/dist-packages/EasyDel/trainer/causal_language_model_trainer.py", line 381, in configure_model
    if not hasattr(self.arguments.configs_to_init_model_class["config"], "get_partition_rules"):
TypeError: "NoneType" object is not subscriptable
Traceback (most recent call last):
  File "/root/train.py", line 94, in <module>
    trainer = CausalLanguageModelTrainer(
  File "/usr/local/lib/python3.10/dist-packages/EasyDel/trainer/causal_language_model_trainer.py", line 233, in __init__
    self.init_functions()
  File "/usr/local/lib/python3.10/dist-packages/EasyDel/trainer/causal_language_model_trainer.py", line 305, in init_functions
    self.model, self.tx, self.scheduler, self.config = self.configure_model()
  File "/usr/local/lib/python3.10/dist-packages/EasyDel/trainer/causal_language_model_trainer.py", line 381, in configure_model
    if not hasattr(self.arguments.configs_to_init_model_class["config"], "get_partition_rules"):
TypeError: "NoneType" object is not subscriptable
wandb: 🚀 View run vibrant-sound-5 at: https://wandb.ai/cyine/easydel-my_first_model_to_train_using_easydel/runs/ri9ge2oc
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20240104_234925-ri9ge2oc/logs
wandb: 🚀 View run northern-vortex-5 at: https://wandb.ai/cyine/easydel-my_first_model_to_train_using_easydel/runs/wvn0sguc
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20240104_234925-wvn0sguc/logs
wandb: 🚀 View run vital-sun-5 at: https://wandb.ai/cyine/easydel-my_first_model_to_train_using_easydel/runs/enyk3wwr
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20240104_234925-enyk3wwr/logs
wandb: 🚀 View run dazzling-sea-8 at: https://wandb.ai/cyine/easydel-my_first_model_to_train_using_easydel/runs/cqws9qy2
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20240104_234925-cqws9qy2/logs
##### Command execution on worker 0 failed with exit status 1. Continuing.
##### Command execution on worker 3 failed with exit status 1. Continuing.
##### Command execution on worker 1 failed with exit status 1. Continuing.
##### Command execution on worker 2 failed with exit status 1. Continuing.

The issue is

The training failed.
It seems to be work not parallel but on each pod.

Thank you,

erfanzar · 2024-01-05T08:00:12Z

erfanzar
Jan 5, 2024
Maintainer

can you please pull and re-run the script ?

0 replies

StableFluffy · 2024-01-05T12:28:06Z

StableFluffy
Jan 5, 2024
Author

still same.

tried to change loss_remat to loss_re_mat as your commit

0 replies

erfanzar · 2024-01-05T17:24:34Z

erfanzar
Jan 5, 2024
Maintainer

actually there"s a funny bug in your code that i notice that right now

you haven"t use configs_to_init_model_class in TrainArguments
run this one

import jax.numpy
from EasyDel import (
    TrainArguments,
    CausalLanguageModelTrainer,
    AutoEasyDelModelForCausalLM,
    EasyDelOptimizers,
    EasyDelSchedulers,
    EasyDelGradientCheckPointers
)
from datasets import load_dataset
import flax
from jax import numpy as jnp
from transformers import AutoTokenizer

huggingface_repo_id_or_path = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"

model, params = AutoEasyDelModelForCausalLM.from_pretrained(huggingface_repo_id_or_path, )

max_length = 2048
tokenizer = AutoTokenizer.from_pretrained(
    huggingface_repo_id_or_path,
    trust_remote_code=True
)
tokenizer.pad_token = tokenizer.eos_token
configs_to_init_model_class = {
    "config": model.config,
    "dtype": jnp.bfloat16,
    "param_dtype": jnp.bfloat16,
    "input_shape": (1, 1)
}

train_arguments = TrainArguments(
    model_class=type(model),
    model_name="my_first_model_to_train_using_easydel",
    num_train_epochs=3,
    configs_to_init_model_class=configs_to_init_model_class ,
    learning_rate=5e-5,
    learning_rate_end=1e-6,
    optimizer=EasyDelOptimizers.ADAMW,  # "adamw", "lion", "adafactor" are supported
    scheduler=EasyDelSchedulers.LINEAR,
    # "linear","cosine", "none" ,"warm_up_cosine" and "warm_up_linear"  are supported
    weight_decay=0.01,
    total_batch_size=64,
    max_steps=None,  # None to let trainer Decide
    do_train=True,
    do_eval=False,  # it"s optional but supported 
    backend="tpu",  # default backed is set to cpu, so you must define you want to use tpu cpu or gpu
    max_length=max_length,  # Note that you have to change this in the model config too
    gradient_checkpointing=EasyDelGradientCheckPointers.NOTHING_SAVEABLE,
    sharding_array=(1, -1, 1, 1),  # the way to shard model across gpu,cpu or TPUs using sharding array (1, -1, 1, 1)
    # everything training will be in fully FSDP automatic and share data between devices
    use_pjit_attention_force=False,
    remove_ckpt_after_load=True,
    gradient_accumulation_steps=8,
    loss_re_mat="",
    dtype=jnp.bfloat16
)


def ultra_chat_prompting_process(
        data_chunk
):
    user_part = [
        chunk["content"] for chunk in data_chunk["messages"] if chunk["role"] == "user"
    ]
    assistant_part = [
        chunk["content"] for chunk in data_chunk["messages"] if chunk["role"] == "assistant"
    ]

    prompt = ""

    for uc, ac in zip(user_part, assistant_part):
        prompt += f"<|user|>\n{uc}</s>\n<|assistant|>\n{ac}</s>\n"

    return {"prompt": prompt}


tokenization_process = lambda data_chunk: tokenizer(
    data_chunk["prompt"],
    add_special_tokens=False,
    max_length=max_length,
    padding="max_length"
)

dataset = load_dataset("HuggingFaceH4/ultrachat_200k")
dataset_train = dataset["train_gen"].map(ultra_chat_prompting_process, num_proc=12)
dataset_train = dataset_train.map(
    tokenization_process,
    num_proc=12,
    remove_columns=dataset_train.column_names
)

# you can do the same for evaluation process dataset

trainer = CausalLanguageModelTrainer(
    train_arguments,
    dataset,
    checkpoint_path=None
)

output = trainer.train(flax.core.FrozenDict({"params": params}))
print(f"Hey ! , here"s where your model saved {output.last_save_file_name}")

0 replies

StableFluffy · 2024-01-06T00:44:28Z

StableFluffy
Jan 6, 2024
Author

the issue before solved but this occurs.

realmaywell@t1v-n-aadb89b8-w-0:~$ sudo gcloud compute tpus tpu-vm ssh node-1 --zone=europe-west4-a --worker=all --command="python train.py"
SSH: Attempting to connect to worker 0...
SSH: Attempting to connect to worker 1...
SSH: Attempting to connect to worker 2...
SSH: Attempting to connect to worker 3...
/usr/local/lib/python3.10/dist-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode="default".
  table = cls._concat_blocks(blocks, axis=0)
/usr/local/lib/python3.10/dist-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode="default".
  table = cls._concat_blocks(blocks, axis=0)
/usr/local/lib/python3.10/dist-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode="default".
  table = cls._concat_blocks(blocks, axis=0)
/usr/local/lib/python3.10/dist-packages/datasets/table.py:1421: FutureWarning: promote has been superseded by mode="default".
  table = cls._concat_blocks(blocks, axis=0)
wandb: Currently logged in as: cyine. Use `wandb login --relogin` to force relogin
wandb: Currently logged in as: cyine. Use `wandb login --relogin` to force relogin
wandb: Currently logged in as: cyine. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.16.1
wandb: Run data is saved locally in /root/wandb/run-20240106_004144-kf1rdvkh
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run stoic-music-21
wandb: ⭐️ View project at https://wandb.ai/cyine/easydel-my_first_model_to_train_using_easydel
wandb: 🚀 View run at https://wandb.ai/cyine/easydel-my_first_model_to_train_using_easydel/runs/kf1rdvkh
Warning : In case of using finetune = True and Passing checkpoint_path = None you should pass parametersin train function
Time For configure dataloaders (ms) : 0.27751922607421875
wandb: Currently logged in as: cyine. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.16.1
wandb: Run data is saved locally in /root/wandb/run-20240106_004144-dkprckr7
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run ethereal-bush-22
wandb: ⭐️ View project at https://wandb.ai/cyine/easydel-my_first_model_to_train_using_easydel
wandb: 🚀 View run at https://wandb.ai/cyine/easydel-my_first_model_to_train_using_easydel/runs/dkprckr7
Warning : In case of using finetune = True and Passing checkpoint_path = None you should pass parametersin train function
Time For configure dataloaders (ms) : 0.29587745666503906
wandb: Tracking run with wandb version 0.16.1
wandb: Run data is saved locally in /root/wandb/run-20240106_004145-29852m23
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run sparkling-salad-23
wandb: ⭐️ View project at https://wandb.ai/cyine/easydel-my_first_model_to_train_using_easydel
wandb: 🚀 View run at https://wandb.ai/cyine/easydel-my_first_model_to_train_using_easydel/runs/29852m23
Warning : In case of using finetune = True and Passing checkpoint_path = None you should pass parametersin train function
Time For configure dataloaders (ms) : 0.26917457580566406
wandb: Tracking run with wandb version 0.16.1
wandb: Run data is saved locally in /root/wandb/run-20240106_004145-dzd3224i
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run rosy-plant-24
wandb: ⭐️ View project at https://wandb.ai/cyine/easydel-my_first_model_to_train_using_easydel
wandb: 🚀 View run at https://wandb.ai/cyine/easydel-my_first_model_to_train_using_easydel/runs/dzd3224i
Warning : In case of using finetune = True and Passing checkpoint_path = None you should pass parametersin train function
Time For configure dataloaders (ms) : 0.2720355987548828
Time For configure Model ,Optimizer ,Scheduler and Config (ms) : 2181.442975997925
Time For configure Model ,Optimizer ,Scheduler and Config (ms) : 2289.212226867676
Time For configure Model ,Optimizer ,Scheduler and Config (ms) : 2198.3790397644043
Time For configure Model ,Optimizer ,Scheduler and Config (ms) : 2131.2127113342285
Time For configure functions and sharding them (ms) : 2847.4793434143066
Time For configure functions and sharding them (ms) : 2840.202569961548
Time For configure functions and sharding them (ms) : 2763.9214992523193
Time For configure functions and sharding them (ms) : 2713.21439743042
Action : Sharding Passed Parameters
Model Contain :  1.100048384  Billion Parameters
0it [00:00, ?it/s]Traceback (most recent call last):
  File "/root/train.py", line 101, in <module>
    output = trainer.train(flax.core.FrozenDict({"params": params}))
  File "/usr/local/lib/python3.10/dist-packages/EasyDel/trainer/causal_language_model_trainer.py", line 647, in train
    filename = f"{self.arguments.model_name}-{sum(losses) / len(losses)}-{i}"
ZeroDivisionError: division by zero
Traceback (most recent call last):
  File "/root/train.py", line 101, in <module>
    output = trainer.train(flax.core.FrozenDict({"params": params}))
  File "/usr/local/lib/python3.10/dist-packages/EasyDel/trainer/causal_language_model_trainer.py", line 647, in train
    filename = f"{self.arguments.model_name}-{sum(losses) / len(losses)}-{i}"
ZeroDivisionError: division by zero
Action : Sharding Passed Parameters
Model Contain :  1.100048384  Billion Parameters
0it [00:00, ?it/s]Traceback (most recent call last):
  File "/root/train.py", line 101, in <module>
    output = trainer.train(flax.core.FrozenDict({"params": params}))
  File "/usr/local/lib/python3.10/dist-packages/EasyDel/trainer/causal_language_model_trainer.py", line 647, in train
    filename = f"{self.arguments.model_name}-{sum(losses) / len(losses)}-{i}"
ZeroDivisionError: division by zero
Traceback (most recent call last):
  File "/root/train.py", line 101, in <module>
    output = trainer.train(flax.core.FrozenDict({"params": params}))
  File "/usr/local/lib/python3.10/dist-packages/EasyDel/trainer/causal_language_model_trainer.py", line 647, in train
    filename = f"{self.arguments.model_name}-{sum(losses) / len(losses)}-{i}"
ZeroDivisionError: division by zero
Action : Sharding Passed Parameters
Model Contain :  1.100048384  Billion Parameters
0it [00:00, ?it/s]Traceback (most recent call last):
  File "/root/train.py", line 101, in <module>
    output = trainer.train(flax.core.FrozenDict({"params": params}))
  File "/usr/local/lib/python3.10/dist-packages/EasyDel/trainer/causal_language_model_trainer.py", line 647, in train
    filename = f"{self.arguments.model_name}-{sum(losses) / len(losses)}-{i}"
ZeroDivisionError: division by zero
Action : Sharding Passed Parameters
Model Contain :  1.100048384  Billion Parameters
0it [00:00, ?it/s]Traceback (most recent call last):
  File "/root/train.py", line 101, in <module>
    output = trainer.train(flax.core.FrozenDict({"params": params}))
  File "/usr/local/lib/python3.10/dist-packages/EasyDel/trainer/causal_language_model_trainer.py", line 647, in train
    filename = f"{self.arguments.model_name}-{sum(losses) / len(losses)}-{i}"
ZeroDivisionError: division by zero
Traceback (most recent call last):
  File "/root/train.py", line 101, in <module>
    output = trainer.train(flax.core.FrozenDict({"params": params}))
  File "/usr/local/lib/python3.10/dist-packages/EasyDel/trainer/causal_language_model_trainer.py", line 647, in train
    filename = f"{self.arguments.model_name}-{sum(losses) / len(losses)}-{i}"
ZeroDivisionError: division by zero
Traceback (most recent call last):
  File "/root/train.py", line 101, in <module>
    output = trainer.train(flax.core.FrozenDict({"params": params}))
  File "/usr/local/lib/python3.10/dist-packages/EasyDel/trainer/causal_language_model_trainer.py", line 647, in train
    filename = f"{self.arguments.model_name}-{sum(losses) / len(losses)}-{i}"
ZeroDivisionError: division by zero
wandb: 
wandb: Run history:
wandb: model billion parameters ▁
wandb: 
wandb: Run summary:
wandb: model billion parameters 1.10005
wandb: 
wandb: 🚀 View run stoic-music-21 at: https://wandb.ai/cyine/easydel-my_first_model_to_train_using_easydel/runs/kf1rdvkh
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20240106_004144-kf1rdvkh/logs
wandb: 
wandb: Run history:
wandb: model billion parameters ▁
wandb: 
wandb: Run summary:
wandb: model billion parameters 1.10005
wandb: 
wandb: 🚀 View run sparkling-salad-23 at: https://wandb.ai/cyine/easydel-my_first_model_to_train_using_easydel/runs/29852m23
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20240106_004145-29852m23/logs
wandb: 
wandb: Run history:
wandb: model billion parameters ▁
wandb: 
wandb: Run summary:
wandb: model billion parameters 1.10005
wandb: 
wandb: 🚀 View run rosy-plant-24 at: https://wandb.ai/cyine/easydel-my_first_model_to_train_using_easydel/runs/dzd3224i
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20240106_004145-dzd3224i/logs
wandb: 
wandb: Run history:
wandb: model billion parameters ▁
wandb: 
wandb: Run summary:
wandb: model billion parameters 1.10005
wandb: 
wandb: 🚀 View run ethereal-bush-22 at: https://wandb.ai/cyine/easydel-my_first_model_to_train_using_easydel/runs/dkprckr7
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20240106_004144-dkprckr7/logs
##### Command execution on worker 3 failed with exit status 1. Continuing.
##### Command execution on worker 2 failed with exit status 1. Continuing.
##### Command execution on worker 0 failed with exit status 1. Continuing.
##### Command execution on worker 1 failed with exit status 1. Continuing.

Thanks,

0 replies

erfanzar · 2024-01-06T07:06:04Z

erfanzar
Jan 6, 2024
Maintainer

funny bug in my code again

import jax.numpy
from EasyDel import (
    TrainArguments,
    CausalLanguageModelTrainer,
    AutoEasyDelModelForCausalLM,
    EasyDelOptimizers,
    EasyDelSchedulers,
    EasyDelGradientCheckPointers
)
from datasets import load_dataset
import flax
from jax import numpy as jnp
from transformers import AutoTokenizer

huggingface_repo_id_or_path = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"

model, params = AutoEasyDelModelForCausalLM.from_pretrained(huggingface_repo_id_or_path, )

max_length = 2048
tokenizer = AutoTokenizer.from_pretrained(
    huggingface_repo_id_or_path,
    trust_remote_code=True
)
tokenizer.pad_token = tokenizer.eos_token
configs_to_init_model_class = {
    "config": model.config,
    "dtype": jnp.bfloat16,
    "param_dtype": jnp.bfloat16,
    "input_shape": (1, 1)
}

train_arguments = TrainArguments(
    model_class=type(model),
    model_name="my_first_model_to_train_using_easydel",
    num_train_epochs=3,
    configs_to_init_model_class=configs_to_init_model_class ,
    learning_rate=5e-5,
    learning_rate_end=1e-6,
    optimizer=EasyDelOptimizers.ADAMW,  # "adamw", "lion", "adafactor" are supported
    scheduler=EasyDelSchedulers.LINEAR,
    # "linear","cosine", "none" ,"warm_up_cosine" and "warm_up_linear"  are supported
    weight_decay=0.01,
    total_batch_size=64,
    max_steps=None,  # None to let trainer Decide
    do_train=True,
    do_eval=False,  # it"s optional but supported 
    backend="tpu",  # default backed is set to cpu, so you must define you want to use tpu cpu or gpu
    max_length=max_length,  # Note that you have to change this in the model config too
    gradient_checkpointing=EasyDelGradientCheckPointers.NOTHING_SAVEABLE,
    sharding_array=(1, -1, 1, 1),  # the way to shard model across gpu,cpu or TPUs using sharding array (1, -1, 1, 1)
    # everything training will be in fully FSDP automatic and share data between devices
    use_pjit_attention_force=False,
    remove_ckpt_after_load=True,
    gradient_accumulation_steps=8,
    loss_re_mat="",
    dtype=jnp.bfloat16
)


def ultra_chat_prompting_process(
        data_chunk
):
    user_part = [
        chunk["content"] for chunk in data_chunk["messages"] if chunk["role"] == "user"
    ]
    assistant_part = [
        chunk["content"] for chunk in data_chunk["messages"] if chunk["role"] == "assistant"
    ]

    prompt = ""

    for uc, ac in zip(user_part, assistant_part):
        prompt += f"<|user|>\n{uc}</s>\n<|assistant|>\n{ac}</s>\n"

    return {"prompt": prompt}


tokenization_process = lambda data_chunk: tokenizer(
    data_chunk["prompt"],
    add_special_tokens=False,
    max_length=max_length,
    padding="max_length"
)

dataset = load_dataset("HuggingFaceH4/ultrachat_200k")
dataset_train = dataset["train_gen"].map(ultra_chat_prompting_process, num_proc=12)
dataset_train = dataset_train.map(
    tokenization_process,
    num_proc=12,
    remove_columns=dataset_train.column_names
)

# you can do the same for evaluation process dataset

trainer = CausalLanguageModelTrainer(
    train_arguments,
    dataset_train,
    checkpoint_path=None
)

output = trainer.train(flax.core.FrozenDict({"params": params}))
print(f"Hey ! , here"s where your model saved {output.last_save_file_name}")

0 replies

StableFluffy · 2024-01-06T08:45:29Z

StableFluffy
Jan 6, 2024
Author

 device: Total 2.1GB
         266.1MB (12.62%): TPU_4(process=1,(2,0,0,0))
         263.1MB (12.48%): TPU_12(process=1,(2,1,0,0))
         263.1MB (12.48%): TPU_13(process=1,(2,1,0,1))
         263.1MB (12.48%): TPU_14(process=1,(3,1,0,0))
         263.1MB (12.48%): TPU_15(process=1,(3,1,0,1))
         263.1MB (12.48%): TPU_5(process=1,(2,0,0,1))
         263.1MB (12.48%): TPU_6(process=1,(3,0,0,0))
         263.1MB (12.48%): TPU_7(process=1,(3,0,0,1))

 kind: Total 2.4GB
         2.1GB (85.09%): buffer
       369.3MB (14.91%): executable

  2%|▏         | 94/6000 [16:00<15:02:43,  9.17s/it, accuracy=0.6707478, learning_rate=4.92e-5, loss=1.3551037, perplexity=3.88, step=94]                                                                                                                                                                                                                                                              2%|▏         | 94/6000 [15:53<15:02:43,  9.17s/it, accuracy=0.6707478, learning_rate=4.92e-5, loss=1.3551037, perplexity=3.88, step=94]                                                                                                                                                                                                                                                              2%|▏         | 94/6000 [15:59<15:02:41,  9.17s/it, accuracy=0.6707478, learning_rate=4.92e-5, loss=1.3551037, perplexity=3.88, step=94]                                                                                                                                                                                                                                                            
 device: Total 2.1GB
         266.1MB (12.62%): TPU_16(process=2,(0,2,0,0))
         263.1MB (12.48%): TPU_17(process=2,(0,2,0,1))
         263.1MB (12.48%): TPU_18(process=2,(1,2,0,0))
         263.1MB (12.48%): TPU_19(process=2,(1,2,0,1))
         263.1MB (12.48%): TPU_24(process=2,(0,3,0,0))
         263.1MB (12.48%): TPU_25(process=2,(0,3,0,1))
         263.1MB (12.48%): TPU_26(process=2,(1,3,0,0))
         263.1MB (12.48%): TPU_27(process=2,(1,3,0,1))

 kind: Total 2.4GB
         2.1GB (85.09%): buffer
       369.5MB (14.91%): executable

 device: Total 2.1GB
         266.1MB (12.62%): TPU_20(process=3,(2,2,0,0))
         263.1MB (12.48%): TPU_21(process=3,(2,2,0,1))
         263.1MB (12.48%): TPU_22(process=3,(3,2,0,0))
         263.1MB (12.48%): TPU_23(process=3,(3,2,0,1))
         263.1MB (12.48%): TPU_28(process=3,(2,3,0,0))
         263.1MB (12.48%): TPU_29(process=3,(2,3,0,1))
         263.1MB (12.48%): TPU_30(process=3,(3,3,0,0))
         263.1MB (12.48%): TPU_31(process=3,(3,3,0,1))

 kind: Total 2.4GB
         2.1GB (85.09%): buffer
       369.5MB (14.91%): executable

 device: Total 2.1GB
         266.1MB (12.62%): TPU_0(process=0,(0,0,0,0))
         263.1MB (12.48%): TPU_1(process=0,(0,0,0,1))
         263.1MB (12.48%): TPU_10(process=0,(1,1,0,0))
         263.1MB (12.48%): TPU_11(process=0,(1,1,0,1))
         263.1MB (12.48%): TPU_2(process=0,(1,0,0,0))
         263.1MB (12.48%): TPU_3(process=0,(1,0,0,1))
         263.1MB (12.48%): TPU_8(process=0,(0,1,0,0))
         263.1MB (12.48%): TPU_9(process=0,(0,1,0,1))

 kind: Total 2.4GB
         2.1GB (85.09%): buffer
       369.5MB (14.91%): executable

  2%|▏         | 95/6000 [16:10<15:08:50,  9.23s/it, accuracy=0.67807883, learning_rate=4.92e-5, loss=1.3022916, perplexity=3.68, step=95]                                                                                                                                                                                                                                                           
 device: Total 2.1GB
         266.1MB (12.62%): TPU_4(process=1,(2,0,0,0))
         263.1MB (12.48%): TPU_12(process=1,(2,1,0,0))
         263.1MB (12.48%): TPU_13(process=1,(2,1,0,1))
         263.1MB (12.48%): TPU_14(process=1,(3,1,0,0))
         263.1MB (12.48%): TPU_15(process=1,(3,1,0,1))
         263.1MB (12.48%): TPU_5(process=1,(2,0,0,1))
         263.1MB (12.48%): TPU_6(process=1,(3,0,0,0))
         263.1MB (12.48%): TPU_7(process=1,(3,0,0,1))

 kind: Total 2.4GB
         2.1GB (85.09%): buffer
       369.5MB (14.91%): executable

  2%|▏         | 95/6000 [16:03<15:08:49,  9.23s/it, accuracy=0.67807883, learning_rate=4.92e-5, loss=1.3022916, perplexity=3.68, step=95]                                                                                                                                                                                                                                                             2%|▏         | 95/6000 [16:09<15:08:50,  9.23s/it, accuracy=0.67807883, learning_rate=4.92e-5, loss=1.3022916, perplexity=3.68, step=95]                                                                                                                                                                                                                                                             2%|▏         | 95/6000 [16:09<15:08:48,  9.23s/it, accuracy=0.67807883, learning_rate=4.92e-5, loss=1.3022916, perplexity=3.68, step=95]                                                                                                                                                                                                                                                           
 device: Total 2.1GB
         266.1MB (12.62%): TPU_16(process=2,(0,2,0,0))
         263.1MB (12.48%): TPU_17(process=2,(0,2,0,1))
         263.1MB (12.48%): TPU_18(process=2,(1,2,0,0))
         263.1MB (12.48%): TPU_19(process=2,(1,2,0,1))
         263.1MB (12.48%): TPU_24(process=2,(0,3,0,0))
         263.1MB (12.48%): TPU_25(process=2,(0,3,0,1))
         263.1MB (12.48%): TPU_26(process=2,(1,3,0,0))
         263.1MB (12.48%): TPU_27(process=2,(1,3,0,1))

 kind: Total 2.4GB
         2.1GB (85.08%): buffer
       369.6MB (14.92%): executable

 device: Total 2.1GB
         266.1MB (12.62%): TPU_0(process=0,(0,0,0,0))
         263.1MB (12.48%): TPU_1(process=0,(0,0,0,1))
         263.1MB (12.48%): TPU_10(process=0,(1,1,0,0))
         263.1MB (12.48%): TPU_11(process=0,(1,1,0,1))
         263.1MB (12.48%): TPU_2(process=0,(1,0,0,0))
         263.1MB (12.48%): TPU_3(process=0,(1,0,0,1))
         263.1MB (12.48%): TPU_8(process=0,(0,1,0,0))
         263.1MB (12.48%): TPU_9(process=0,(0,1,0,1))

 kind: Total 2.4GB
         2.1GB (85.08%): buffer
       369.6MB (14.92%): executable

 device: Total 2.1GB
         266.1MB (12.62%): TPU_20(process=3,(2,2,0,0))
         263.1MB (12.48%): TPU_21(process=3,(2,2,0,1))
         263.1MB (12.48%): TPU_22(process=3,(3,2,0,0))
         263.1MB (12.48%): TPU_23(process=3,(3,2,0,1))
         263.1MB (12.48%): TPU_28(process=3,(2,3,0,0))
         263.1MB (12.48%): TPU_29(process=3,(2,3,0,1))
         263.1MB (12.48%): TPU_30(process=3,(3,3,0,0))
         263.1MB (12.48%): TPU_31(process=3,(3,3,0,1))

 kind: Total 2.4GB
         2.1GB (85.08%): buffer
       369.6MB (14.92%): executable

 device: Total 2.1GB
         266.1MB (12.62%): TPU_4(process=1,(2,0,0,0))
         263.1MB (12.48%): TPU_12(process=1,(2,1,0,0))
         263.1MB (12.48%): TPU_13(process=1,(2,1,0,1))
         263.1MB (12.48%): TPU_14(process=1,(3,1,0,0))
         263.1MB (12.48%): TPU_15(process=1,(3,1,0,1))
         263.1MB (12.48%): TPU_5(process=1,(2,0,0,1))
         263.1MB (12.48%): TPU_6(process=1,(3,0,0,0))
         263.1MB (12.48%): TPU_7(process=1,(3,0,0,1))

 kind: Total 2.4GB
         2.1GB (85.08%): buffer
       369.6MB (14.92%): executable

  2%|▏         | 96/6000 [16:19<15:08:08,  9.23s/it, accuracy=0.6724632, learning_rate=4.92e-5, loss=1.3242081, perplexity=3.76, step=96]                                                                                                                                                                                                                                                              2%|▏         | 96/6000 [16:12<15:08:09,  9.23s/it, accuracy=0.6724632, learning_rate=4.92e-5, loss=1.3242081, perplexity=3.76, step=96]                                                                                                                                                                                                                                                              2%|▏         | 96/6000 [16:18<15:08:08,  9.23s/it, accuracy=0.6724632, learning_rate=4.92e-5, loss=1.3242081, perplexity=3.76, step=96]                                                                                                                                                                                                                                                              2%|▏         | 96/6000 [16:18<15:08:07,  9.23s/it, accuracy=0.6724632, learning_rate=4.92e-5, loss=1.3242081, perplexity=3.76, step=96]

Now it working like this

Does this look good? cuz this feels bit slow.
and is it normal to having 4 wandb log on this 1 train?

Thank you,

0 replies

erfanzar · 2024-01-06T08:47:17Z

erfanzar
Jan 6, 2024
Maintainer

it"s not normal and library by default will only use one wandb and for second tip that I can give you is
you have 32*16 GB ram you can use higher batch size instead of 32 or 64
try 256

0 replies

StableFluffy · 2024-01-06T08:48:41Z

StableFluffy
Jan 6, 2024
Author

Okay I"ll try increase batch size.
Any workaround i can do to fix wandb thing?

0 replies

erfanzar · 2024-01-06T09:02:16Z

erfanzar
Jan 6, 2024
Maintainer

how you are using EasyDel?
you downloaded that via Pypi or cloned this from github

0 replies

StableFluffy · 2024-01-06T09:07:50Z

StableFluffy
Jan 6, 2024
Author

Pypi

0 replies

erfanzar · 2024-01-06T09:09:46Z

erfanzar
Jan 6, 2024
Maintainer

use GitHub method, that"s fixed after last version

pip install git+https://github.com/erfanzar/EasyDeL.git -U

0 replies

StableFluffy · 2024-01-06T09:10:32Z

StableFluffy
Jan 6, 2024
Author

ah okay thank you
I"ll try it

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help train on tpu v3-32 #74

{{title}}

Replies: 12 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Help train on tpu v3-32 #74

StableFluffy Jan 4, 2024

Replies: 12 comments

erfanzar Jan 5, 2024 Maintainer

StableFluffy Jan 5, 2024 Author

erfanzar Jan 5, 2024 Maintainer

StableFluffy Jan 6, 2024 Author

erfanzar Jan 6, 2024 Maintainer

StableFluffy Jan 6, 2024 Author

erfanzar Jan 6, 2024 Maintainer

StableFluffy Jan 6, 2024 Author

erfanzar Jan 6, 2024 Maintainer

StableFluffy Jan 6, 2024 Author

erfanzar Jan 6, 2024 Maintainer

StableFluffy Jan 6, 2024 Author

StableFluffy
Jan 4, 2024

erfanzar
Jan 5, 2024
Maintainer

StableFluffy
Jan 5, 2024
Author

erfanzar
Jan 5, 2024
Maintainer

StableFluffy
Jan 6, 2024
Author

erfanzar
Jan 6, 2024
Maintainer

StableFluffy
Jan 6, 2024
Author

erfanzar
Jan 6, 2024
Maintainer

StableFluffy
Jan 6, 2024
Author

erfanzar
Jan 6, 2024
Maintainer

StableFluffy
Jan 6, 2024
Author

erfanzar
Jan 6, 2024
Maintainer

StableFluffy
Jan 6, 2024
Author