Skip to content

Embedding, quantization, and vector indexing. Designed for speed.

License

Notifications You must be signed in to change notification settings

CompendiumLabs/ziggy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ziggy

The documentation is pretty sparse right now, but most of the action is in llm.py, index.py, and database.py. Note that there are custom torch kernels for quantization, dequantization, and quantized matmul, so you'll need to compile and install those first for vector index capabilities.

Install

For simple usage, no additional steps are needed beyond cloning the repository. If you want quantization support (optional), you need to build the custom torch extensions, in the extension directory:

cd extension
python setup.py install --user

For CUDA builds, the environment variable CUDA_HOME must be set to your CUDA directory. To build in CPU-only mode, set the environment variable ZIGGY_CPU_ONLY=1.

If you need to use an older C compiler with nvcc, set CUDAHOSTCXX to the directory containing the c binary.

To build standalone testing binaries (very optional), get libtorch for CUDA, assign its absolute path to LIBTORCH_PATH, and run:

mkdir build
cmake -B build -DCMAKE_PREFIX_PATH=$LIBTORCH_PATH
cmake --build build --config Release

Usage

To load up a huggingface embedding model just run

model_id = 'TaylorAI/bge-micro-v2'
mod = ziggy.HuggingfaceEmbedding(model_id)

By default it does not optimization and uses the maximum token window length. If you want to say use ONNX optimization and set a 256 length token window

mod = ziggy.HuggingfaceEmbedding(model_id, onnx=True, max_len=256)

You can also use llama.cpp via llama-cpp-python with

mod = ziggy.LlamaCppPython(path_to_gguf)

Then we can embed text with the mod.embed function with either a string or a list of strings. To load in big corpora, I first convert them to a single JSONL file to load in. To load the file path into an index

db = ziggy.DocumentDatabase.from_jsonl(path, embed=mod, delim='\n', min_len=100)

Here we set the within record delimiter to a single newline and discard any chunks with less than 100 characters. This is a pretty good approach for Wikipedia. If you want to quantize the stored vectors, you can do something like the following

qspec = ziggy.QuantSpec.from_width(ziggy.QuantType.qint4, 0.0, 0.1)
db = ziggy.DocumentDatabase.from_jsonl(path, embed=mod, qspec=qspec, delim='\n', min_len=100)

Now you can do top-k semantic search on the database with something like

db.search('Anarchism', 5)

By default the indexing operation will have generated average embeddings at the document level and these are used as a first pass when searching.

About

Embedding, quantization, and vector indexing. Designed for speed.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published