How to train in mult-node? #67

Nightbringers · 2023-08-25T02:43:42Z

Nightbringers
Aug 25, 2023

I dont konw how to train use mult-host. Can you gave a example? thank you.

erfanzar · 2023-08-25T06:33:29Z

erfanzar
Aug 25, 2023
Maintainer

hello
yes for sure but which backend you want to use for multi host training is it on multiple TPU pods or GPU servers?

0 replies

Nightbringers · 2023-08-25T06:38:33Z

Nightbringers
Aug 25, 2023
Author

GPU servers and slurm cluster

0 replies

erfanzar · 2023-08-25T07:38:43Z

erfanzar
Aug 25, 2023
Maintainer

can you please test this code

import jax
from EasyDel import TrainArguments, CausalLMTrainer

num_processes = 6
process_id = 0  # number between 0 and num_processes-1 that says which node is current node
coordinator_address = "ip:port"  # for example 192.168.1.12:8600 (make sure this port is not closed by firewall)
jax.distributed.initialize(coordinator_address=coordinator_address,
                           num_processes=num_processes,
                           process_id=process_id)

train_args = TrainArguments(
    backend="gpu",
    sharding_array=(num_processes, -1, 1),
    use_wandb=True,
    use_pjit_attention_force=False
)

trainer = CausalLMTrainer(
    arguments=train_args,
    dataset_train=..., # To Be passed
    ckpt_path=... # To Be passed path to ckpt or None 
)
parameters = None  # if you want to finetune a model you can pass parameters to trainer and they should be like frozen({"params":...})
trainer.train(model_parameters=parameters or None)

0 replies

Nightbringers · 2023-08-26T08:43:44Z

Nightbringers
Aug 26, 2023
Author

I encountered some issues when I tried to run EasyDeL/examples/training/causal-lm/llama.py. I use llama-13b. I first convert hf-llama to flax. Use this code:

model = AutoModelForCausalLM.from_pretrained(path)
state_dict = model.state_dict()
flax_params = llama_convert_hf_to_flax(state_dict, num_hidden_layers=40, num_attention_heads=40, hidden_size=5120,device = device)
save_ckpt(flax_params, "flax_param_easydel")

Is my convert code correct?

then I run llama.py, here is the error:
ValueError: One of pjit outputs was given the sharding of NamedSharding(mesh={"dp": 1, "fsdp": 8, "mp": 1}, spec=PartitionSpec("fsdp",)), which implies that the global size of its dimension 0 should be divisible by 8, but it is equal to 40076 (full shape: (40076, 5120))

0 replies

erfanzar · 2023-08-27T07:42:35Z

erfanzar
Aug 27, 2023
Maintainer

when you trying to run pass fully_fsdp=False in config.get_partition_rules that will fix this problem

0 replies

Nightbringers · 2023-08-28T08:10:52Z

Nightbringers
Aug 28, 2023
Author

This seems to disable fully_fsdp. But I want use fully_fsdp.

0 replies

erfanzar · 2023-08-29T07:36:37Z

erfanzar
Aug 29, 2023
Maintainer

which model you trying to use
can you give me the config of model that you trying to use so i can write a custom partition rule for that

0 replies

Nightbringers · 2023-08-29T07:45:44Z

Nightbringers
Aug 29, 2023
Author

llama-13b v2. But the model"s vocabulary size has changed. Does this have any impact?

0 replies

erfanzar · 2023-08-30T07:31:59Z

erfanzar
Aug 30, 2023
Maintainer

yes that"s the reason that you get this error can you tell me your vocab size? like is it like EOS,BOS added version that have 32002 tokens?

0 replies

erfanzar · 2023-08-30T07:40:57Z

erfanzar
Aug 30, 2023
Maintainer

use this

from jax.sharding import PartitionSpec as PS
from EasyDel import TrainArguments

partition_rules = (

    ("transformer/wte/embedding", PS("dp", "fsdp")),

    ("attention/(wq|wk|wv)/kernel", PS("fsdp")),
    ("attention/wo/kernel", PS("fsdp")),

    ("feed_forward/w1/kernel", PS("fsdp")),
    ("feed_forward/w2/kernel", PS("fsdp")),
    ("feed_forward/w3/kernel", PS("fsdp")),

    ("attention_norm/kernel", PS("fsdp")),
    ("ffn_norm/kernel", PS("fsdp")),

    ("transformer/ln_f/kernel", PS("fsdp")),
    ("lm_head/kernel", PS("fsdp", "dp")),
    (".*", PS("fsdp")),
)

train_args = TrainArguments(
    custom_rule=partition_rules,
    ...
)

this one have to work fine if you just have changed the vocab size

0 replies

Nightbringers · 2023-08-30T10:15:12Z

Nightbringers
Aug 30, 2023
Author

yes, it worked!

0 replies

erfanzar · 2023-08-30T13:23:37Z

erfanzar
Aug 30, 2023
Maintainer

if you have any other issue please let me know <3

0 replies

Nightbringers · 2023-08-31T11:50:09Z

Nightbringers
Aug 31, 2023
Author

The calculation of losses has encountered an issue. The error is:
File "/EasyDeL/EasyDel/trainer/fsdp_train.py", line 406, in train
sharded_train_state_, loss, accuracy = self.sharded_train_step_fn(sharded_train_state_,
File "/EasyDeL/EasyDel/trainer/fsdp_train.py", line 309, in fsdp_train_step_
(loss__, accuracy__), grad = grad_fn(state.params)
File "/EasyDeL/EasyDel/trainer/fsdp_train.py", line 303, in calculate_loss
loss, accuracy = loss_fn(
File "/anaconda3-2/envs/py3.10/lib/python3.10/site-packages/fjutils/easylm.py", line 495, in blockwise_cross_entropy
logits = rearrange(logits, "(n c) d -> n c d", c=chunk_size)
File "/anaconda3-2/envs/py3.10/lib/python3.10/site-packages/einops/einops.py", line 483, in rearrange
return reduce(cast(Tensor, tensor), pattern, reduction="rearrange", **axes_lengths)
File "/anaconda3-2/envs/py3.10/lib/python3.10/site-packages/einops/einops.py", line 420, in reduce
raise EinopsError(message + "\n {}".format(e))
einops.EinopsError: Error while processing rearrange-reduction pattern "(n c) d -> n c d".
Input tensor shape: (16368, 55296). Additional info: {"c": 1024}.
Shape mismatch, can"t divide axis of length 16368 in chunks of 1024

my vocab size is 55296, sequence_length is 1024.

0 replies

erfanzar · 2023-08-31T12:01:20Z

erfanzar
Aug 31, 2023
Maintainer

set loss_remat to ""

train_args = TrainArguments(
    custom_rule=partition_rules,
    loss_remat=""
)

this will work, the current error that you takin is because you trying to use blockwise crossentropy and your vocab size (55296) is not visible by 1024 so you can either change loss_remat to "" or change your loss_chunk

0 replies

Nightbringers · 2023-09-02T05:57:46Z

Nightbringers
Sep 2, 2023
Author

    if self.arguments.loss_remat != "":
        blockwise_cross = functools.partial(
            blockwise_cross_entropy,
            chunk_size=self.arguments.loss_chunk,
            policy=self.arguments.loss_remat
        )
        loss_fn = blockwise_cross
    else:
        loss_fn = cross_entropy_loss_and_accuracy

What is the difference between blockwise_cross and cross_entropy_loss_and_accuracy? What is the difference between blockwise crossentropy and crossentropy? Are there any advantages to using blockwise_cross？

0 replies

erfanzar · 2023-09-07T11:55:47Z

erfanzar
Sep 7, 2023
Maintainer

sorry i explained a part of it wrong
you should not use this

from flax.traverse_util import unflatten_dict, flatten_dict
flax_params = llama_convert_hf_to_flax(state_dict, num_hidden_layers=40, num_attention_heads=40, hidden_size=5120,device = device)
flax_params = flatten_dict(flax_params)
pt_params = llama_convert_flax_to_pt(flax_params, n_layers=40, dim=5120, num_attention_heads=40)

use this instead

from flax.traverse_util import unflatten_dict, flatten_dict
flax_params = llama_convert_hf_to_flax(state_dict, num_hidden_layers=40, num_attention_heads=40, hidden_size=5120,device = device)
flax_params = flatten_dict(flax_params,sep=".")
pt_params = llama_convert_flax_to_pt(flax_params, n_layers=40, dim=5120, num_attention_heads=40)

0 replies

erfanzar · 2023-09-11T11:55:51Z

erfanzar
Sep 11, 2023
Maintainer

Docs Are available at https://erfanzar.github.io/EasyDeL/docs/

0 replies

Nightbringers · 2023-09-11T14:35:45Z

Nightbringers
Sep 11, 2023
Author

thanks, I will keep testing when I have time.

0 replies

Nightbringers · 2023-09-12T06:32:26Z

Nightbringers
Sep 12, 2023
Author

import error

Traceback (most recent call last):
File "EasyDeL/test_llama.py", line 1, in
from EasyDel import TrainArguments, CausalLMTrainer
File "EasyDeL/EasyDel/init.py", line 2, in
from .modules import FlaxLlamaModel, FlaxLlamaForCausalLM, LlamaConfig,
File "/EasyDeL/EasyDel/modules/init.py", line 5, in
from .falcon import FalconConfig, FlaxFalconModel, FlaxFalconForCausalLM
File "/EasyDeL/EasyDel/modules/falcon/init.py", line 1, in
from .modelling_falcon_flax import FlaxFalconForCausalLM, FlaxFalconModel, FalconConfig
File "/EasyDeL/EasyDel/modules/falcon/modelling_falcon_flax.py", line 519, in
class FlaxFalconPretrainedModel(FlaxPreTrainedModel):
File "/EasyDeL/EasyDel/modules/falcon/modelling_falcon_flax.py", line 528, in FlaxFalconPretrainedModel
precision: Optional[None, jax.lax.Precision] = jax.lax.Precision("fastest")
~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
File "/anaconda3/envs/easy/lib/python3.11/typing.py", line 355, in inner
return func(*args, **kwds)
^^^^^^^^^^^^^^^^^^^
File "/anaconda3/envs/easy/lib/python3.11/typing.py", line 478, in getitem
return self._getitem(self, parameters)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/anaconda3/envs/easy/lib/python3.11/typing.py", line 700, in Optional
arg = _type_check(parameters, f"{self} requires a single type.")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/anaconda3/envs/easy/lib/python3.11/typing.py", line 197, in _type_check
raise TypeError(f"{msg} Got {arg!r:.100}.")
TypeError: typing.Optional requires a single type. Got (None, <class "jax._src.lax.lax.Precision">).

0 replies

erfanzar · 2023-09-12T11:35:48Z

erfanzar
Sep 12, 2023
Maintainer

fixed im sorry for such error :)

0 replies

Nightbringers · 2023-09-13T05:03:04Z

Nightbringers
Sep 13, 2023
Author

if train use mult-host, does the dataset need any additional processing?
And
total_batch_size=FLAGS.batch_size,
this total_batch_size is sum of per host batch_size? for example, total_batch_size = 16, and there are 2 nodes all have 8 gpus. Then per gpu batch_size is 1? per node batch_size is 8?
Is this right?

0 replies

erfanzar · 2023-09-13T08:26:13Z

erfanzar
Sep 13, 2023
Maintainer

yes for using easydel you should preprocess you dataset you should pass the tokenized dataset that contains input_ids and attention mask

and for batch size you pass the batch size for each step being multiplied to number of gradient accumulation steps for example imagine that you have passed batch size of 8 to trainer with gradient accumulation 8 the total batch size for data loader become 64 and if you have 2 hosts this will become 32 batch size for each host and if you have 8 GPUs per each machine this will become 4 batch per-each GPU

0 replies

Nightbringers · 2023-09-15T08:03:40Z

Nightbringers
Sep 15, 2023
Author

warning: Linking two modules of different target triples: "LLVMDialectModule" is "nvptx64-nvidia-gpulibs" whereas "" is "nvptx64-nvidia-cuda"
Does this warning have any impact?

what is use_pjit_attention_force ?
What is the difference between use_pjit_attention_force=false and use_pjit_attention_force=true?

0 replies

Nightbringers · 2023-09-15T09:28:40Z

Nightbringers
Sep 15, 2023
Author

and use_flash_attention seems not work, I found that their speed is the same whether set to true or false.

0 replies

Nightbringers · 2023-09-15T09:47:51Z

Nightbringers
Sep 15, 2023
Author

I set max_sequence_length = 10240, had this error:
ValueError: Incompatible shapes for broadcasting: (1, 1, 1, 10240) and requested shape (1, 1, 8192, 8192)
I suppose because my model max_position_embeddings=8192?

And when use large sequence_length occurs loss=nan.

0 replies

erfanzar · 2023-09-15T15:08:28Z

erfanzar
Sep 15, 2023
Maintainer

Yes you are right you should change you model max length

0 replies

Nightbringers · 2023-09-18T07:22:25Z

Nightbringers
Sep 18, 2023
Author

use_flash_attention seems not work, I found that their speed is the same whether set to true or false. And what is use_pjit_attention_force ?

0 replies

Nightbringers · 2023-09-20T06:07:32Z

Nightbringers
Sep 20, 2023
Author

I"m really looking forward to Mojo version. Is Mojo a replacement for Jax, or can they work together?

0 replies

erfanzar · 2023-09-21T01:24:46Z

erfanzar
Sep 21, 2023
Maintainer

Actually mojo is more native at least the version that im creating and its works without any imported libraries in mojo since mojo is fast, native and compiled language i coded everything unique and the only library in use from python is os library only to read the size of check points (mojo don"t have any built in I/O library)

0 replies

Nightbringers · 2023-09-22T02:11:40Z

Nightbringers
Sep 22, 2023
Author

that was awesome. you want make a framework like tensorflow or pytorch based on mojo. This requires a significant amount of work.

0 replies

How to train in mult-node? #67

Nightbringers Aug 25, 2023

Replies: 36 comments

erfanzar Aug 25, 2023 Maintainer

Nightbringers Aug 25, 2023 Author

erfanzar Aug 25, 2023 Maintainer

Nightbringers Aug 26, 2023 Author

erfanzar Aug 27, 2023 Maintainer

Nightbringers Aug 28, 2023 Author

erfanzar Aug 29, 2023 Maintainer

Nightbringers Aug 29, 2023 Author

erfanzar Aug 30, 2023 Maintainer

erfanzar Aug 30, 2023 Maintainer

Nightbringers Aug 30, 2023 Author

erfanzar Aug 30, 2023 Maintainer

Nightbringers Aug 31, 2023 Author

erfanzar Aug 31, 2023 Maintainer

Nightbringers Sep 2, 2023 Author

erfanzar Sep 7, 2023 Maintainer

erfanzar Sep 11, 2023 Maintainer

Nightbringers Sep 11, 2023 Author

Nightbringers Sep 12, 2023 Author

erfanzar Sep 12, 2023 Maintainer

Nightbringers Sep 13, 2023 Author

erfanzar Sep 13, 2023 Maintainer

Nightbringers Sep 15, 2023 Author

Nightbringers Sep 15, 2023 Author

Nightbringers Sep 15, 2023 Author

erfanzar Sep 15, 2023 Maintainer

Nightbringers Sep 18, 2023 Author

Nightbringers Sep 20, 2023 Author

erfanzar Sep 21, 2023 Maintainer

Nightbringers Sep 22, 2023 Author

Nightbringers
Aug 25, 2023

erfanzar
Aug 25, 2023
Maintainer

Nightbringers
Aug 25, 2023
Author

erfanzar
Aug 25, 2023
Maintainer

Nightbringers
Aug 26, 2023
Author

erfanzar
Aug 27, 2023
Maintainer

Nightbringers
Aug 28, 2023
Author

erfanzar
Aug 29, 2023
Maintainer

Nightbringers
Aug 29, 2023
Author

erfanzar
Aug 30, 2023
Maintainer

erfanzar
Aug 30, 2023
Maintainer

Nightbringers
Aug 30, 2023
Author

erfanzar
Aug 30, 2023
Maintainer

Nightbringers
Aug 31, 2023
Author

erfanzar
Aug 31, 2023
Maintainer

Nightbringers
Sep 2, 2023
Author

erfanzar
Sep 7, 2023
Maintainer

erfanzar
Sep 11, 2023
Maintainer

Nightbringers
Sep 11, 2023
Author

Nightbringers
Sep 12, 2023
Author

erfanzar
Sep 12, 2023
Maintainer

Nightbringers
Sep 13, 2023
Author

erfanzar
Sep 13, 2023
Maintainer

Nightbringers
Sep 15, 2023
Author

Nightbringers
Sep 15, 2023
Author

Nightbringers
Sep 15, 2023
Author

erfanzar
Sep 15, 2023
Maintainer

Nightbringers
Sep 18, 2023
Author

Nightbringers
Sep 20, 2023
Author

erfanzar
Sep 21, 2023
Maintainer

Nightbringers
Sep 22, 2023
Author