This project implements an experimental framework for evaluating how providing relevant high quality context affects the quality of language model responses. It uses a vector database to retrieve similar question-answer pairs and compares model outputs with and without this additional context.
The system:
- Loads reference QA pairs from specified datasets
- Stores them in a vector database for similarity search
- For each experimental question:
- Retrieves similar QA pairs as context
- Generates responses both with and without context
- Evaluates response quality using a reward model
- Stores results for analysis
- Parallel processing for efficient vector database population
- Support for multiple LLM architectures
- Configurable embedding and reward models
- SQLite results storage with comprehensive metrics
- GPU acceleration support
- Batched processing for memory efficiency
- Python 3.8
- PyTorch
- Transformers
- vLLM
- LangChain
- ChromaDB
- SQLAlchemy
- Datasets (HuggingFace)
- tqdm
- Clone the repository
- Install dependencies:
pip install -r requirements.txt
Edit config.py
to customize:
- Dataset sources
- Model selections
- Database paths
- Experiment parameters
Default configuration:
reference_datasets = [
("mlabonne/orca-agentinstruct-1M-v1-cleaned", "default"),
]
experiment_dataset = "HuggingFaceTB/smoltalk"
embedding_model = "BAAI/bge-small-en-v1.5"
llm_model = "Qwen/Qwen2.5-7B-Instruct"
reward_model = "internlm/internlm2-7b-reward"
Run the complete experiment:
bash run_experiment.sh
This will:
- Populate the vector database using parallel processing
- Execute the main experiment
- Store results in SQLite database
- Populate vector database:
python parallel_insertion.py --use_gpu
- Run experiment:
python main.py
Vector database population:
# CPU-only mode
python parallel_insertion.py --num_workers 4
# Specify GPU count
python parallel_insertion.py --use_gpu --num_workers 2
config.py
: Configuration parametersdata_loader.py
: Dataset loading utilitiesdatabase.py
: Vector and SQL database managementexperiment.py
: Core experimental logicmodel_manager.py
: Model loading and inferenceparallel_insertion.py
: Parallel vector database populationmain.py
: Experiment entry pointrun_experiment.sh
: Convenience script
Handles loading and preprocessing of reference and experimental datasets.
Manages two database systems:
- ChromaDB for vector similarity search
- SQLite for experimental results storage
Handles:
- Model loading/unloading
- Response generation
- Response quality evaluation
Orchestrates the experimental process:
- Vector database setup
- Batch processing of questions
- Context-based response generation
- Quality evaluation
- Results storage
Results are stored in SQLite with the following schema:
- question: Original question
- context_score: Similarity score of retrieved context
- context_qa: Retrieved similar QA pair
- with_context_answer: Model response with context
- without_context_answer: Model response without context
- with_context_score: Quality score with context
- without_context_score: Quality score without context
- with_context_better: Boolean indicating if context improved response