Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling
Abstract
Large language models are trained on massive scrapes of the web, which are often unstructured, noisy, and poorly phrased. Current scaling laws show that learning from such data requires an abundance of both compute and data, which grows with the size of the model being trained. This is infeasible both because of the large compute costs and duration associated with pre-training, and the impending scarcity of high-quality data on the web. In this work, we propose Web Rephrase Augmented Pre-training (WRAP) that uses an off-the-shelf instruction-tuned model prompted to paraphrase documents on the web in specific styles such as "like Wikipedia" or in "question-answer format" to jointly pre-train LLMs on real and synthetic rephrases. First, we show that using WRAP on the C4 dataset, which is naturally noisy, speeds up pre-training by sim3x. At the same pre-training compute budget, it improves perplexity by more than 10% on average across different subsets of the Pile, and improves zero-shot question answer accuracy across 13 tasks by more than 2%. Second, we investigate the impact of the re-phrasing style on the performance of the model, offering insights into how the composition of the training data can impact the performance of LLMs in OOD settings. Our gains are attributed to the fact that re-phrased synthetic data has higher utility than just real data because it (i) incorporates style diversity that closely reflects downstream evaluation style, and (ii) has higher 'quality' than web-scraped data.
Community
I love to see research like this. Especially given how it used less compute. Intuition ought to imply higher computer due to heavy pre-pre-processing. (Can I call it that? π)
been doing this for months. Highly effective. 10x more effective with reinforced with self-notes tokens containing deductive logic. which I personally find to be more effective alone, since you can retain the original salient representations in the source training samples. We all know OSS models writing styles are pretty trash and repetitive. Great research though. The era of hybrid/synthetic data has FULLY arrived.
Isn't this just a roundabout way of distilling the LLM used to rephrase the data?
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- From Beginner to Expert: Modeling Medical Knowledge into General LLMs (2023)
- Improving Text Embeddings with Large Language Models (2023)
- EcomGPT-CT: Continual Pre-training of E-commerce Large Language Models with Semi-structured Data (2023)
- Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs (2023)
- CLAMP: Contrastive LAnguage Model Prompt-tuning (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
See connected papers for this paper: access upstream and downstream papersβ graph and interact visually.
Unlocking Faster AI: How WRAP Transforms Language Models with Synthetic Data!
Links π:
π Subscribe: https://www.youtube.com/@Arxflix
π Twitter: https://x.com/arxflix
π LMNT (Partner): https://lmnt.com/
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper