This is the code used for creating GermanRAG, a German dataset for finetuning LLMs on Retrieval Augmented Generation tasks (RAG).
- Install the requirements with
pip install -r requirements.txt
. - Generate
germandpr_subset.jsonl
withpython germandpr.py
- Clone Airoboros,
pip install -e .
there and copygermandpr_subset.jsonl
aswell asconfig_germanrag.yaml
into the root directory. - Copy
airoboros/instructors/germanrag.py
andairoboros/instructors/prompts/germanrag.txt
from this repo to the respective directories in Airoboros. - Add
from airoboros.instructors.germanrag import generate as germanrag_generator
here. - Add
"germanrag": germanrag_generator
here. - Run
airoboros generate-instructions --config-path config_germanrag.yaml
- Copy your generated
instructions.jsonl
back into this repo's root directory. - Optional: Validate generations with
python validate_generations.py
. - Run
python germanrag.py
to generate the final dataset.
- Choose how to deduplicate/collapse the contexts in GermanDPR, i.e. on shortest, longest, first/random answer span.
- Fix function for three sentence context window.
- Experimental/Optional: Finish choping and mixing of contexts on chunk level.
- Add (true) negatives beyond hard negatives, by pairing with random/dissimilar contexts.
- Generalize to more datasets in SQuAD format.
- The GermanRAG dataset is derived from GermanDPR, see 'Acknowledgments' in the dataset card.
- Airoboros by Jon Durbin, consider giving a tip ;)
Feel free to open issues/PRs and come join us in our Discord! 😊
Check out our models at DiscoResearch 🪩🧪.