Speed #113

defdet · 2024-02-27T01:31:27Z

defdet
Feb 27, 2024

The work is impressive, but I see that there are no benchmarks. Could you please clear some concerns about speed and perfomance of this library? Specifically (for TPUs):

Is your implementation of flash attention (also ring attention) faster than normal attention? Is it comparable to Pytorch"s in terms of memory saving and speed?
General perfomance. Have you measured your llama vs hf"s llama? Also, does modeling in flax result in the same speed compared to modeling in pure functional jax?

erfanzar · 2024-02-27T13:14:39Z

erfanzar
Feb 27, 2024
Maintainer

hello, and thanks!
I haven"t measured the speed difference between the EasyDeL llama and the hugging face llama cause I don"t have access to the GPUs that are required for this task I"m currently writing a white paper for that and there are some parts that this competition is not fair at all, for example, I have access to TPU-v4/3 for research and development and they are faster than the GPUs that I can have access to and PyTorch-Xla is not even close to JAX in case of using TPUs

implementation of Flash attention is edited from the beta version released by Google itself, and Ring attention is created and improved from ring-attention-near-infinite-context-length paper

But generally using flash attention is 40 % more Memory efficient and faster by ~50/40 % than flax.linen attention
Ring-attention is tested with TPUs-v4 and I have achieved 512K context length on the 6.7B model (llama) using scan_mlp, attn_mechanism ring, and shard_kv_cache set to True with sharding_mesh_array (1, 1, 6, -1)

modeling in pure Jax is much faster than flax (about 3x up to 12x faster) but in case of being user-friendly and making people able to use EasyDeL I use Jax but I have no problem with releasing a pure JAX version for every model that doesn"t seem to be a hard thing to do but when Jax is much harder than PyTorch for newcomers and flax is kinda GNN Like I stick to flax
if you have any better idea or suggest reimplementing models in pure Jax I would personally like to do that myself, but flax EasyDeL implementation of some models still seems like to be much faster than PyTorch on A10G

2 replies

defdet Mar 1, 2024
Author

Thanks a lot for such a speedy reply. I"m very interested about ring attention, how many cores were in the v4 TPU you used? Also, does ring attention require re-training? (I really couldn"t find information about it)

erfanzar Mar 11, 2024
Maintainer

No ring attention doesn"t require re training your model
I used tpuv126 (64 chips)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed #113

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Speed #113

defdet Feb 27, 2024

Replies: 1 comment · 2 replies

erfanzar Feb 27, 2024 Maintainer

defdet Mar 1, 2024 Author

erfanzar Mar 11, 2024 Maintainer

defdet
Feb 27, 2024

Replies: 1 comment 2 replies

erfanzar
Feb 27, 2024
Maintainer

defdet Mar 1, 2024
Author

erfanzar Mar 11, 2024
Maintainer