Skip to content

GermanRAG - a German dataset for finetuning Retrieval Augmented Generation

License

Notifications You must be signed in to change notification settings

rasdani/germanrag

Repository files navigation

GermanRAG 🇩🇪📜🦜

This is the code used for creating GermanRAG, a German dataset for finetuning LLMs on Retrieval Augmented Generation tasks (RAG).

How to use

  • Install the requirements with pip install -r requirements.txt.
  • Generate germandpr_subset.jsonl with python germandpr.py
  • Clone Airoboros, pip install -e . there and copy germandpr_subset.jsonl aswell as config_germanrag.yaml into the root directory.
  • Copy airoboros/instructors/germanrag.py and airoboros/instructors/prompts/germanrag.txt from this repo to the respective directories in Airoboros.
  • Add from airoboros.instructors.germanrag import generate as germanrag_generator here.
  • Add "germanrag": germanrag_generator here.
  • Run airoboros generate-instructions --config-path config_germanrag.yaml
  • Copy your generated instructions.jsonl back into this repo's root directory.
  • Optional: Validate generations with python validate_generations.py.
  • Run python germanrag.py to generate the final dataset.

Room for improvement

  • Choose how to deduplicate/collapse the contexts in GermanDPR, i.e. on shortest, longest, first/random answer span.
  • Fix function for three sentence context window.
  • Experimental/Optional: Finish choping and mixing of contexts on chunk level.
  • Add (true) negatives beyond hard negatives, by pairing with random/dissimilar contexts.
  • Generalize to more datasets in SQuAD format.

Acknowledgments

Collaborate

Feel free to open issues/PRs and come join us in our Discord! 😊

Check out our models at DiscoResearch 🪩🧪.

About

GermanRAG - a German dataset for finetuning Retrieval Augmented Generation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published