Table of Contents
What is this? Load quantized LLMs and run them on your local devices. Interact with your own notes (currently markdown notes are supported by default) and ask questions about what you have written and get relevant answers from what you have written! Recommended model is TinyLlama-chat https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/blob/main/README.md based on its speed of token generation and it's low system requirements. Try the 8-bit quantized model to start with and if that's still too beefy you can always try 5, 4, etc.
Basically, RAGGA is an local LLM based AI assistant that cares about your privacy, and does not require running things via untrusted third party APIs. The one exception is the web search feature, which can be completely disabled, so nothing ever leaves your system.
Download one of the TinyLlama-chat quantized model (e.g. 8-bit quantized TinyLlama-1.1B-Chat-v1.0-GGUF) and copy it to the models directory.
Install all the prerequistes and then install RAGGA.
Edit the config.yaml
file to point to the model you downloaded in the generator:
llm_path:
, and update the dataset:
path:
key to point to your directory containing markdown files.
Run the tinyllama_example.py
script to try it out.
Due to issues with hatch not allowing pip options completely:
- PyTorch needs to be installed manually
- llama-cpp-python needs to be installed manually
- meaning you need a c compiler (e.g. Visual Studio 2022 Community C , build-essentials, etc)
When the issues are resolved with hatch, this will become significantly easier to install.
Both CPU and GPU will require llama-cpp-python
to be installed. The pip --pre
flag is required because the current stable release as of writing does not support phi-2.
- Any python 3.11 installation, venv, conda, etc.
- Faiss (CPU) will be installed with the package
- Install PyTorch
- Windows / MacOS
pip install torch
- Linux
pip install torch --index-url https://download.pytorch.org/whl/cpu
- Windows / MacOS
- Install llama-cpp-python
- Windows / Linux / MacOS: Default (supports CPU without acceleration)
pip install --pre llama-cpp-python
- Windows / Linux / MacOS: Default (supports CPU without acceleration)
- or Install llama-cpp-python with OpenBLAS Hardware Acceleration (optional)
- Windows: OpenBLAS
$env:CMAKE_ARGS = "-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"
pip install --pre llama-cpp-python
- Linux/MacOS: OpenBLAS
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python
- Windows: OpenBLAS
Make sure you have CUDA 12.1 or higher installed.
- Install miniforge / miniconda
- Create a new environment with python 3.11
conda create -n ragga python=3.11
- Activate that environment and install faiss, pytorch and llama-cpp-python
conda activate ragga
conda install faiss-gpu pytorch pytorch-cuda=12.1 -c pytorch -c nvidia -c conda-forge
- Windows: llama-cpp-python with CUBLAS acceleration
$env:CMAKE_ARGS="-DLLAMA_CUBLAS=on"
pip install --pre llama-cpp-python
- Linux: llama-cpp-python with CUBLAS acceleration
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install --pre llama-cpp-python
Note: You do not need to use conda
/mamba
to install faiss-gpu, but as there are no wheels for it, you will need to compile it yourself, this is not covered here but see the faiss build documentation
pip install 'ragga[cpu] @ https://github.com/zeyus/RAGGA/releases/download/v0.0.6/ragga-0.0.6-py3-none-any.whl'
pip install 'ragga @ https://github.com/zeyus/RAGGA/releases/download/v0.0.5/ragga-0.0.5-py3-none-any.whl'
Subjective evaluation of 3 different models was performed by generating responses to general and dataset-specific questions for 4 publicly available Obsidian notes repositories. The results for response quality were ranked from 0-10 for each question, responses were shown anonymously so no indication of which model generated the responses was visible.
The phi-2 and phi-2 chat used the same base model but a different prompt template. For evaluation and rating details see model_tests.py and scripts/03_rank_output.py.
ragga
is distributed under the terms of the MIT license.