arxiv:2407.21783

The Llama 3 Herd of Models

Published on Jul 31

· Submitted by

akhaliq on Aug 1

#1 Paper of the day

Upvote

105

Authors:

Abhimanyu Dubey ,

Abhishek Kadian ,

Alan Schelten ,

Angela Fan ,

Abstract

Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development.

View arXiv page View PDF Add to collection

Community

akhaliq

Paper submitter Aug 1

https://llama.meta.com/

oguzhanercan

Aug 1

533 authors ...

oguzhanercan

Aug 1

•

edited Aug 2

What a shame(!), they wrote the name Santosh Janardhan wrongly

Soontosh

Aug 1

533 authors is insane 💀

librarian-bot

Aug 2

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

laelhalawani

Sep 30

•

edited Sep 30

Hi @meta-llama we're reviewing the paper and would like to get a bit of clarification on a fragment about curriculum learning.
We're trying to really understand the numbers and proportions of these different stages.
Specifically:

We pre-train Llama 3 405B using AdamW with a peak learning rate of 8 × 10−5 , a linear warm up of 8,000 steps, and a cosine learning rate schedule decaying to 8 × 10−7 over 1,200,000 steps. We use a lower batch size early in training to improve training stability, and increase it subsequently to improve efficiency. Specifically, we use an initial batch size of 4M tokens and sequences of length 4,096, and double these values to a batch size of 8M sequences of 8,192 tokens after pre-training 252M tokens. We double the batch size again to 16M after pre-training on 2.87T tokens. We found this training recipe to be very stable: we observed few loss spikes and did not require interventions to correct for model training divergence.

Does that mean

Pretraining first stage
252M tokens pretraining = 63 batches of 4M tokens (1000 samples x 4k sequence length) ~ 63 000 pretraining samples
Pretraining second stage
2.87T tokens pretraining = 360k batches of 8M tokens (1000 samples x 8k sequence length) ~ 360 000 000 main training samples
3 Last pretraining stage
??? tokens pretraing = ??? batches of 16M tokens (2000 samples x 8k sequence length) ~ ???? how many samples did the model see in this stage?

Or should it be understood in some other way? Because it appears that either some information is missing or (more likely) I'm misunderstanding the statements. There are two pretraining steps mentioned, but 3 sizes of pretraining. Also the vocabulary used as batch size of X tokens, normally batch sizes are given in the number of samples per batch, so taht's a bit confusing too.
By 'training tokens' do we mean training tokens within unique samples? Or did you train more than 1 epoch?
Would really appreciate clarification, thank you! :)