Abstract
Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development.
Community
533 authors ...
533 authors is insane 💀
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DataComp-LM: In search of the next generation of training sets for language models (2024)
- Enhancing Model Performance: Another Approach to Vision-Language Instruction Tuning (2024)
- VoCo-LLaMA: Towards Vision Compression with Large Language Models (2024)
- LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training (2024)
- Evolving Subnetwork Training for Large Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Hi @meta-llama we're reviewing the paper and would like to get a bit of clarification on a fragment about curriculum learning.
We're trying to really understand the numbers and proportions of these different stages.
Specifically:
We pre-train Llama 3 405B using AdamW with a peak learning rate of 8 × 10−5 , a linear warm up of 8,000 steps, and a cosine learning rate schedule decaying to 8 × 10−7 over 1,200,000 steps. We use a lower batch size early in training to improve training stability, and increase it subsequently to improve efficiency. Specifically, we use an initial batch size of 4M tokens and sequences of length 4,096, and double these values to a batch size of 8M sequences of 8,192 tokens after pre-training 252M tokens. We double the batch size again to 16M after pre-training on 2.87T tokens. We found this training recipe to be very stable: we observed few loss spikes and did not require interventions to correct for model training divergence.
Does that mean
- Pretraining first stage
252M tokens pretraining = 63 batches of 4M tokens (1000 samples x 4k sequence length) ~ 63 000 pretraining samples - Pretraining second stage
2.87T tokens pretraining = 360k batches of 8M tokens (1000 samples x 8k sequence length) ~ 360 000 000 main training samples
3 Last pretraining stage
??? tokens pretraing = ??? batches of 16M tokens (2000 samples x 8k sequence length) ~ ???? how many samples did the model see in this stage?
Or should it be understood in some other way? Because it appears that either some information is missing or (more likely) I'm misunderstanding the statements. There are two pretraining steps mentioned, but 3 sizes of pretraining. Also the vocabulary used as batch size of X tokens, normally batch sizes are given in the number of samples per batch, so taht's a bit confusing too.
By 'training tokens' do we mean training tokens within unique samples? Or did you train more than 1 epoch?
Would really appreciate clarification, thank you! :)
Models citing this paper 17
Browse 17 models citing this paperDatasets citing this paper 0
No dataset linking this paper